SlideShare a Scribd company logo
1 of 23
Mining and Analyzing Online Social Graph Data


                     Drew Conway

            New York University, Dept. of Politics



                     May 13, 2010
Agenda



  Network basics
         Unit of analysis
         Data representation
         Analysis & Visualization
  Thoughts on research design
         Edge contexts
         A bit on social graph data and ethics
  Digging into the data
         Where to get it
         Our first scrape and build!
  Real-time social graph analysis
         Build the network of Twitter users in the room
         Hold our breath...
Live demo setup




   Last week Bruno asked how many of you use Twitter
       15 (54%) members said yes, which may be enough to do some interesting
       stuff
   Please tweet the following hash-tag:

                                 #analyticsnyc
   Some example tweets:
       Many thanks to the @GiltGroupe for hosting tonight’s #analyticsnyc
       meetup
       Hanging out with @DKALab and @brunocm at #analyticsnyc
       I cannot wait to watch @drewconway crash and burn with his
       #analyticsnyc live demo
Network basics
                                                    The study of relationships
                                Research design
                                                    Data representation
                               Digging into data
                                                    Analysis & Visualization
                             Live demonstration

Networks and the study of relationships


   Network theory use the language of graph theory
                                      G      =     {V , E }
                                                         ↑
   In the abstract this is very powerful, we have a general purpose way to
   represent any number of relations
       G={Routers,Packet Traffic}
       Edges have non-binary values
       G={U.S. Airports, Commercial Routes}
       Edges can represent distance, cost, frequency, etc.
       G={New York City nerds, Co-membership in Meetups}
       Edges are implied!
   While both nodes and edges are needed to have a network of any substance, it
   is the edge (relationship) that will always be the primary focus of our analyses



                                  Drew Conway       Mining and Analyzing Online Social Graph Data
Network basics
                                                     The study of relationships
                               Research design
                                                     Data representation
                              Digging into data
                                                     Analysis & Visualization
                            Live demonstration

Why focus on the edge?

   Consider a very simple example of two people meeting for the first time...

                                                  In reality, the meeting may reveal a large
 If we focus on the nodes, we might               degree of shared structure:
 consider this meeting creates a
 dyad:




 We know, however, that people do
 not exist as isolates and this
 assumption is ignoring all of                    By expressing this meeting in terms of the
 exogenous social structure these                 nodes as a function of their edges we may
 individuals brings with them.                    gain a much richer understanding of the
                                                  structural dynamics

                                 Drew Conway         Mining and Analyzing Online Social Graph Data
Network basics
                                                      The study of relationships
                                   Research design
                                                      Data representation
                                  Digging into data
                                                      Analysis & Visualization
                                Live demonstration

Representing network data: the canonical

   Perhaps the most natural way to represent the relationships between N actors
   is with an NxN matrix, often referred to as a “sociomatrix”
        X1    X2    ...    XN
  X1    0     1     ...     0                 Very intuitive representation
  X2    1     0     ...     1                 Can support directional and weighted edges
  .
  .      .
         .     .
               .    ..      .
                            .
  .      .     .       .    .                 Application of matrix algebraic operation for
  XN    0     1     ...     0                 analysis

   As an introduction to representing relationship, a matrix provides insight to a
   network by itself. In practice, however, this representation has many limitations
       Unwieldy as network sizes scale up
       Most networks are sparse; too many zeroes
       Difficult to publish and share

   Fortunately, there are many other options for representing network data!


                                     Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                                      The study of relationships
                                   Research design
                                                      Data representation
                                  Digging into data
                                                      Analysis & Visualization
                                Live demonstration

Representing network data: the practical
      Remember, it is all about the edges; therefore, more efficient representations
      will be limited to edge data

                   Edge list                                                Adjacency list

 A text file with two columns (usually                 Also a text file, however, here the first
 delimited by a space or tab), where first             column is the source, and all
 column in source, and second is target               subsequent entries are “adjacent”
                                                      nodes
 1    2
 1    3
 2    1
 6    7
 6    8                                               1    2     3
 10   12                                              2    1
 10   13                                              6    7     8
                                                      10   12    13


      While these are the most universal data formats, network data representation is
      a bit of a cottage industry
           Pajek (.net)
           GraphML (.gml)
           GraphViz (.dot)
           ..and domain specific formats (eek!)
                                     Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                                    The study of relationships
                                 Research design
                                                    Data representation
                                Digging into data
                                                    Analysis & Visualization
                              Live demonstration

A bit on tools
   The number of software suites and packages available for conducting social
   network analysis has exploded over the past ten years
       In general, this software can be categorized in two ways:
            Type - many SNA tools are developed to be standalone applications, while
            others are language specific packages
            Intent - consumers and producer of SNA come from a wide range of
            technical expertise and/or need, therefore, there exist simple tools for data
            collection and basic analysis, as well as complex suites for advanced research

                   Standalone Apps                                 Modules & Packages
                   - ORA (Windows)                     - libSNA (Python)
     Basic         - Analyst Notebook (Windows) - UrlNet (Python)
                   - KrakPlot (Windows)                - NodeXL (MS Excel)
                   - UCINet (Windows)                  - NetworkX (Python)
     Advanced - Pajek (Multi)                          - JUNG (Java)
                   - Network Workbench (Multi)         - igraph (Python, R & Ruby)
   Many of the above tools have visualization components, but several tools are
   designed specifically for visualization: Graphviz, NetDraw, Tom Sawyer, Gephi,
   etc.
   What I use
                                   Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                                    The study of relationships
                                 Research design
                                                    Data representation
                                Digging into data
                                                    Analysis & Visualization
                              Live demonstration

Comparing two network metrics to find key actors

   Often network analysis is used to identify key actors within a social group. To
   identify these actors, various centrality metrics can be computed based on a
   network’s structure
       Degree (number of connections)
       Betweenness (number of shortest paths an actor is on)
       Closeness (relative distance to all other actors)
       Eigenvector centrality (leading eigenvector of sociomatrix)
   One method for using these metrics to identify key actors is to plot actors’
   scores for Eigenvector centrality versus Betweenness. Theoretically, these
   metrics should be approximately linear; therefore, any non-linear outliers will be
   of note.
       An actor with very high betweenness but low EC may be a critical
       gatekeeper to a central actor
       Likewise, an actor with low betweenness but high EC may have unique
       access to central actors


                                   Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                                   The study of relationships
                                Research design
                                                   Data representation
                               Digging into data
                                                   Analysis & Visualization
                             Live demonstration

First, visualize the data
   Visualization can be the best first step in the analytical process

                                                                        This will give you a good
                                                                        feel for what is going on
                                                                        with your relationships.
                                                                        For large networks this is
                                                                        often not possible
                                                                        For this example, we will
                                                                        use the main component
                                                                        of the social network
                                                                        collected on drug users in
                                                                        Hartford, CT. The network
                                                                        has 194 nodes and 273
                                                                        edges.
                                                                        Here I am using the
                                                                        GUESS visualizer in NWB
                                                                        with a Kamada-Kawai
                                                                        layout

                                  Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                                       The study of relationships
                                    Research design
                                                       Data representation
                                   Digging into data
                                                       Analysis & Visualization
                                 Live demonstration

Finding Key Actors with R

   The first steps are to load the data into memory, and perform some basic
   centrality analysis1

   Load the data into igraph
   library(igraph)
   G<-read.graph("drug_main.txt",format="edgelist")
   G<-as.undirected(G)
   # By default, igraph inputs edgelist data as a directed graph.
   # In this step, we undo this and assume that all relationships are reciprocal.




   Store metrics in new data frame
   cent<-data.frame(bet=betweenness(G),eig=evcent(G)$vector)
   # evcent returns lots of data associated with the EC, but we only need the
   # leading eigenvector
   res<-lm(eig~bet,data=cent)$residuals
   cent<-transform(cent,res=res)
   # We will use the residuals in the next step



     1
         Weeks, et al (2002) http://dx.doi.org/10.1023/A:1015457400897
                                      Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                                                     The study of relationships
                                      Research design
                                                                     Data representation
                                     Digging into data
                                                                     Analysis & Visualization
                                   Live demonstration

Finding Key Actors with R

                                                                                   Key Actor Analysis for Hartford Drug Users
                                                             1.0                                                                                                           44

 Plot the data                                                                                                                                                                                     res

 library(ggplot2)                                            0.8                                                                                                                                         −0.2
 # We use ggplot2 to make things a                                                                                                                                                                             0
 # bit prettier                                                                                                                                                                                               0.2
 p<-ggplot(cent,aes(x=bet,y=eig,                                                                                                    28
                                                                                                                                                    53                                                        0.4
     label=rownames(cent),colour=res,                        0.6                                                                                                                                              0.6
     size=abs(res)))+xlab("Betweenness




                                                    Centrality
                                                  Eigenvector
     Centrality")+ylab("Eigenvector                                               50
                                                                                                                                                                                                   abs(res)
     Centrality")
                                                                                                                                                                                                         0.1
 # We use the residuals to color and
                                                             0.4                                                                                                                                         0.2
 # shape the points of our plot,                                   141
                                                                                                                      47

 # making it easier to spot outliers.                                        20                                                     56
                                                                                                                                                                                                         0.3
                                                                   155
                                                                   58                      43
 p+geom_text()+opts(title="Key Actor                                                     100
                                                                                         22                                                                   126                                        0.4
                                                                                                                    57
                                                                   147
     Analysis for Hartford Drug Users")                                  23
                                                                            140                    18                                                                                                    0.5
                                                                                               104      19
 # We use the geom_text function to plot                     0.2    75 65
                                                                     84
                                                                   139
                                                                   41                                                                                9
                                                                                                                                                                                                         0.6
                                                                   13685
 # the actors’ ID’s rather than points                              48
                                                                   17                   87
                                                                                       55          54

                                                                    63 89105
                                                                        94 77
                                                                                                                    184
                                                                                                                                                                                              67
                                                                                                                                                                                                         0.7
 # so we know who is who                                             101
                                                                           64                165
                                                                                                                          93

                                                                    3349                                                                 162
                                                                   86                                                          91
                                                                     177
                                                                   170 21
                                                                   36168
                                                                   174
                                                                   52    111            27              16
                                                                   42       68
                                                                    6      12     69
                                                                         148
                                                                   169 149                                           78
                                                                   183
                                                                   181
                                                                   110 172
                                                                   25
                                                                   185
                                                                   182
                                                                   150       61              11         176
                                                                                                              130
                                                                                                                                     45                                                  79
                                                                                    138 66 62
                                                             0.0
                                                                   124
                                                                   26
                                                                   164 121
                                                                   163
                                                                   189
                                                                   179
                                                                   152
                                                                   119 127 38 88 151
                                                                   133
                                                                   161
                                                                   166
                                                                   107276114399 46 5 97 608 118 4
                                                                   120 240 137
                                                                   109 107159 157144 73 116
                                                                    51 95 90
                                                                   135
                                                                   186 190 123
                                                                   187
                                                                   156 173 192
                                                                   146
                                                                    96
                                                                   122 82
                                                                   125 143
                                                                   134 37132142 72
                                                                    13
                                                                   112 71
                                                                   167
                                                                    31
                                                                   153
                                                                   145
                                                                   128
                                                                     129
                                                                   188 193
                                                                     191
                                                                    15
                                                                   171
                                                                    92
                                                                    39
                                                                    13
                                                                   180
                                                                   158
                                                                   175 178 106
                                                                   108
                                                                   194
                                                                   113
                                                                    98
                                                                    81
                                                                    59
                                                                    83
                                                                    80
                                                                    74103                 154 24               117 70
                                                                                                         160 115
                                                                                                          34
                                                                                                          131                       14
                                                                                                                                    30         29        35                       102



                                                                   0                          1000                             2000             3000                4000   5000   6000
                                                                                                                               Betweenness Centrality




                                        Drew Conway                  Mining and Analyzing Online Social Graph Data
Network basics
                                                                                                                                                                                                                                                                                                                                                                                                               The study of relationships
                                                                                                                                                                                                                                                                            Research design
                                                                                                                                                                                                                                                                                                                                                                                                               Data representation
                                                                                                                                                                                                                                                                           Digging into data
                                                                                                                                                                                                                                                                                                                                                                                                               Analysis & Visualization
                                                                                                                                                                                                                                                                         Live demonstration

Key Actor Plot
                                                                                                                                                                                                                         q




                                                                                                                                                                                                                                                                                     q

                                                                                                                                                                                                                                                                                                         q




                                                                                                                q
                                                                                                                                                                                                                                                 q




                                                                                                                                                                                                                                                                     q




                                                                                                                                                     q
                                                                                                                                                                                         q


                                                                                                                                                                                                                                                                                                             q                q




                                                                                                                                                                                                                                                                                         q

                                                        q
                                                                                                                                                                                                                                                                                                                                                                                                                   q


                                                                                                                                                                                                                                         q




                                                                                                                                                                                                                                                                             q
                                                                                                                                                                             q
                                                                    q



                                                                                                                                                                                                             q




                                                                                                                    q                                                                                                                                                                                            q
                                                                                                                                                                                                                                                                                                                                          q
                            q                                                                                                                    q



                                                    q
                                                                                    q                                                                                                                                                                                        q                               q                                        q            q
                                                                                                                                                                                                                                                                                                                                                                               q
                                                                                                                                                                                                                         q




                                                                                                                                                                                                                                                                                                                                                      141
                                                                                                                                                                                                                                                                                                                                                                                                                           q




                            q
                                                                                                        q                                                                                                                                                                        q
                                                                                                                                                                                                                                                                                                                                                  q        155                                                                                                                 Network plot
                                                                                                                                                                                                                                                                                                                                                           q
                                                                                                                                                                                                                                                                                                                                                                                                                                       q




                                                                                                                                                                                                                                                                                                                                  q
                                                                                                                                                                                                                                                                                                                                                                       q
                                                                                                                                    q



                                q                                       q
                                                                                                                                                                                                                                                                                                      58 47                                   44                                               q
                                                                                                                                                                                                                                                                                                     q q                                                                                                                                                                       # Create positions for all of
                                                                                                                                                                                                                                                                                                                                   q                  28
        q                                                                                                                                                                                                                                                                    q

                                                                                                        q
                                                                                                                                                                                                                                                                                                                             q                                                                                                 q
                                                                                                                                                                                                                                                                 q


            q                                   q
                                                                                                                                         q


                                                                                                                                                                                                                 q
                                                                                                                                                                                                                                                                                             q                                                                     q


                                                                                                                                                                                                                                                                                                                                                                                       q


                                                                                                                                                                                                                                                                                                                                                                                                                                                                   q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               # the nodes w/ force directed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               l<-layout.fruchterman.reingold(G,
                                                                                                                                                                                                                                                                                                 53
                                                                                                                                                                                                                                                                                         q
                                                                                                                                                                                                                                                                                                                                                  q                                                        q



                                                                            q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                           q

q



                        q                                                                           q                                                                                                                                                                    q                                                            q                                                                                                            q

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   niter=500)
                                                                                                                                                                                                                                                                                                                         20                                 50
                                                                                                                                                                                                                                                                                                                                                           q
                            q

                                                                                                                                                                                                                                                                                         q
                                                                                                                                                                                                                                                                                                                         q                                 q                                                                                                                   # Set the nodes’ size relative to
 q
                                                                q                                                                                            q

                                                                                                                                                                                                                                                                                                                                  q
                                                                                                                                                                                                                                                                                                                                          q
                                                                                                                                                                                                     q




                                        q


                                                                                                                    q
                                                                                                                                                                                                                                                         q
                                                                                                                                                                                                                                                                                                 q
                                                                                                                                                                                                                                                                                                                                                                   q
                                                                                                                                                                                                                                                                                                                                                                                   q




                                                                                                                                                                                                                                                                                                                                                                                                                                                                               # their residual value
                                                                                            q
                                                                                                                                                                                                                     q
                                                                                                                                                                                                                                                             q




                                                                                                                                                                                                                                                                                                                         q
                                                                                                                                                                                                                                                                                                                                              q       q
                                                                                                                                                                                                                                                                                                                                                                                                           q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               V(G)$size<-abs(res)*10
                                                                                                                                                                                                                                             q                                                   q




                q                   q                                                                                                                                                                                                                                                                                                                 q
                                                                                                                                                                                                                                                                                                                                                                       q                               q
                                                                                                                                                                                                                                                                                                                                                                                                                                                   q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               # Only display the labels of key
                                                                q




                                                                                                                                                                                                                                                                                                                                                           q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               # players
                                                                                                                                                                                                                                                                                                                                                                                                                                   q

                                                                                                                                                                                             q
                                                                                                                                                                                                                                                                 q




                                                                                                                                                                                                                                                                                     q               q
                                                                                                                                                                                                                                                                                                                                      q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               nodes<-as.vector(V(G)+1)
                            q

    q
                    q




                                                                                                                                q
                                                                                                                                                                                                                                                                                                                                                       q


                                                                                                                                                                                                                                                                                                                                                                                           q




                                                                                                                                                                                                                                                                                                                                                                                                                                                       q
                                                                                                                                                                                                                                                                                                                                                                                                                                                               q




                                                                                                                                                                                                                                                                                                                                                                                                                                                                               # Key players defined as have a
                                        q
                                                                            q
            q
                                                                                                                                                                 q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               # residual value >.25
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               nodes[which(abs(res)<.25)]<-NA
                                                                                                                                                                                                                                                                                                                                                                                                                       q
                                                                                                                                                                                     q

                                                                                            q
                                                                                                                                                                                                                 q
                                                                                                                                                                                                                     79
                                                                                                                                                                                                                                                                                                                                                                                                                                                           q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               # Save plot as PDF
                                                                                                                            q                                                                                                                                                                                                                                                                                                              q



                                                            q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               pdf(‘actor_plot.pdf’,pointsize=7)
                                                                                                    q
                                                                                                                                                                         q                               q
                                                                                                                                                                                                                                                                                                                                                                                                                                                               q               plot(G,layout=l,vertex.label=nodes,
                                                                    q                                                                                            q

                                                                                                                                        102                                                                                                                                                                                                                                                                                    q


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   vertex.label.dist=0.25,
                                                                                                                                q
                                            q               q                                                                                                                                                                q
                                                                                                                                                                                                                                                                                                                                                                                                   q                                           q
                                                                                                                                                                                                                                                                                                                                                                                                                                                       q
                                                                                                                                                                                                 q


                                                                            q
                                                                                                            q
                                                                                                                                                                         q

                                                                                                                                                                         q
                                                                                                                                                                                                                                                     q
                                                                                                                                                                                                                                                                                                                                                                       q



                                                                                                                                                                                                                                                                                                                                                                                                                                                                       q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   vertex.label.color=‘red’,edge.width=1)
                                                                                                                                                                     q
                                                                                                q
                                                                                                                                q
                                                                                                                                                                                                                                 q       q
                                                                                                                                                                                                                                                                                                     q                                    q
                                                                                                                                                                                                                                                                                                                                                                               q
                                                                                                                                                                                                                                                                                                                                                                                                                                           q
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               dev.off()
                                                                                                                                         q                                                   q
                                                                                                                                                                                                                                                                                                                                                                                                               q
                                                    q                           q                                       q
                                                                                                                                                             q                                                                                                                       q
                                                                                                                                                                                                                                                                                                                 q


                                                                                                                                                                                                                                                         q

                                                        q

                                                                                                                                                                                 q



                                                                                                                                                                                                                                             q
                                                                                                                                                                                                                                                                                             q                                                                             q
                                                                                                                                                                                                         q
                                                                                                                                q                        q

                                                                                                                                                                                                                                                                                                                                                               q
                                                                                                                                                                                                                                                                                                                                                                                                   q
                                                                                        q

                                                                                                                                                                                                                                                                                         q


                                                                                                                                                                                                                                                                                                                                                                       q




                                                                                                                            q                                                                                                        q                                                                               q




                                                                                                                                                         q


                                                                                                                                                                                                                                                         q

                                                                                                                                             q

                                                                                                                                                                             q




                                                                                                                                                                                                                                                                                                 Drew Conway                                                                                                   Mining and Analyzing Online Social Graph Data
Network basics
                                          Research design    Edge context
                                         Digging into data   Network data and ethics
                                       Live demonstration

Not all edges are created equally

   I have spent a lot of time tonight describing network analysis as a way to
   understand relationships
        Depending on the source and context of the data, these relationships can be interpreted as many different
        things.
        Must consider the data generation process by which the edge was created
        With respect to online social graph data, ask: how do people use this service?


                                                     Facebook                             Google SocialGraph
         Twitter
                                       Ties here are driven by                         Connecting all of Google’s
Recent study shows
                                       personal contact                                social sites together
Twitter is used as news
                                       me → offline → you                                me → anything → you
aggregation service
me → interests → you                           Offline relationships                         The combining of
                                               are driving                                 multiple platforms
    Geography and
                                               “friending”                                 into a single
    history less important
                                               Considerable amount                         “network” makes
    Networks may cluster                                                                   analysis and
                                               of meta-data already
    around communities                                                                     interpretation
                                               in FB, what does this
    of interest                                                                            difficult
                                               add?

                                            Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                Research design    Edge context
                               Digging into data   Network data and ethics
                             Live demonstration

Ethics and network data
   Why should we be discusses ethics at a data analytics seminar?
 “People have really gotten comfortable
 not only sharing more information and             “Money is a terrible master but an
 different kinds, but more openly and               excellent servant”
 with more people. That social norm is
 just something that has evolved over              “There’s a sucker born every minute”
 time.”
                                                                                             P.T. Barnum
      Mark Zuckerburg, CEO, Facebook
   In isolation, the data we provide online about our relationships and preferences
   are fairly innocuous. Their summation, however, can illuminate aspects of
   our lives that can be exploited for any number of ways.
        Social networking services are moving previously private data to the public
        Simply because data is available does not mean that individuals want or
        realize that it can be used to make inferences about who they are or their
        lifestyles
        Analysts are caught in the middle. There is no IRB on the Internet!

                                  Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                               Research design    Edge context
                              Digging into data   Network data and ethics
                            Live demonstration

Examples of ethically questionable network analyses
   MIT study predicts sexual orientation from Facebook friends
        Two undergraduate students start a project ‘Gaydar’
        Claim they can predict which men are homosexual simply based on their
        friend structure
        Problem: Extremely private information, questionable methods
   Pleaserobme.com
        Combined information from fousquare.com and Twitter to publicize when
        people were clearly not at home
        Used as a proof of concept to show the danger of publicizing localization
        information
        Problem: Useful for raising awareness, hurtful to anyone who was actually
        robbed
   Project Grey Goose
        Large group of former and current DoD/IC analysts collaborated to study
        the identity of hackers
        I was personally involved in this project
        Using public web forum data, built up user profiles and social networks to
        attempt to identify hackers affiliated with Russian government
        Problem: While no names were published, bordering on vigilantism
                                 Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                   Research design    Getting data
                                  Digging into data   First scrape and build
                                Live demonstration

Where to get social graph data

   Recently, there has been an explosion of resources for scraping social graph
           Service       Data                                            API Docs

                         Following(ers), @-replies, date/time/geo        http://apiwiki.twitter.com/

                         Friends, Wall Posts, date/time                  http://developers.facebook.com/docs/api

                         All SocialGraph relationships                   http://code.google.com/apis/socialgraph/

                         Friends, Check-ins                              http://foursquare.com/developers/
                         “Taste graph”, recommendations                  http://hunch.com/developers/

                         Congressional votes, campaign finance            http://developer.nytimes.com/docs

   There is clearly no shortage of data
       Each service provides different relational context
       Data formats are generally JSON, Atom, XML, or some combination
       For a more extensive list of API resources, see HackNY wiki of local
       startups


                                     Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                           Research design    Getting data
                          Digging into data   First scrape and build
                        Live demonstration

Building the social network among LiveJournal users




                                              Using a “seed” user, we will build out a
                                              network
                                                      In Python, use NetworkX, cjson
                                                      and a other standard scientific
                                                      libraries, parse the SocialGraph
                                                      data
                                                      Through a process called
                                                      “k-snowball searching”
                                                      seed → friend → · · · → friendk
                                                              Seed: imichaeldotorg.livejournal.com
                                                              k =3

                                                      Note the low value of k




                             Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                              Research design    Getting data
                                             Digging into data   First scrape and build
                                           Live demonstration

The code, part 1

 Loading the libraries and setting things up
 from cjson import *
 from urllib import *
 from networkx import *
                                                                 Name:                    [‘http://imichaeldotorg.livejournal.com/’]
 from time import *
                                                                 Type:                    DiGraph
 from scipy import array,unique
                                                                 Number of nodes:         5
 ...
                                                                 Number of edges:         5
 if __name__ == "__main__":
                                                                 Average in degree:       1.0
     seed_url=‘‘http://imichaeldotorg.livejournal.com"
                                                                 Average out degree:      1.0
     sg=get_sg(seed_url)
     net,newnodes=create_egonet(sg)
     info(net)




    Get the JSON from SocialGraph
          def get_sg(seed_url):
            sgapi_url="http://socialgraph.apis.google.com/lookup?q="+seed_url+"&edo=1&edi=1&fme=1&pretty=0"
            try:
                furl=urlopen(sgapi_url)
                fr=furl.read()
                furl.close()
                return fr
            except IOError:
                print "Could not connect to website"
                print sgapi_url
                return




                                                Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                             Research design      Getting data
                                            Digging into data     First scrape and build
                                          Live demonstration

Build egonet and snowball

 Creating the egonet                                   Rolling the snowball
 def create_egonet(s):                                 def snowball_round(G,seeds,myspace=False):
     try:                                                  t0=time()
         raw=decode(s)                                     if myspace:
         G=DiGraph()                                           seeds=get_myspace_url(seeds)
         pendants=[]                                       sb_data=[]
         n=raw[’nodes’]                                    for s in range(0,len(seeds)):
         nk=n.keys()                                           s_sg=get_sg(seeds[s])
         G.name=str(nk)                                        new_ego,pen=create_egonet(s_sg)
         pendants=[]                                           for p in pen:
         for a in range(0,len(nk)):                                    sb_data.append(p)
             for b in range(0,len(nk)):                        if s<1:
                 if a!=b:                                          sb_net=compose(G,new_ego)
                      G.add_edge(nk[a],nk[b])                  else:
         for k in nk:                                              sb_net=compose(new_ego,sb_net)
             ego=n[k]                                          del new_ego
             ego_out=ego[’nodes_referenced’]                   if s==round(len(seeds)*0.2):
             for o in ego_out:                                     sb_net.name=’20% complete’
                 G.add_edge(k,o)                                   sb_net.info()
                 pendants.append(o)                                print ’AT: ’+strftime(’%m/%d/%Y, %H:%M:%S’, gmtime())
             ego_in=ego[’nodes_referenced_by’]                     print ’’
             for i in ego_in:                              ...
                 G.add_edge(i,k)                           # More time keeping, probably a MUCH better way to do this
                 pendants.append(i)                        sb_data=array(sb_data)
         pendants=array(pendants,dtype=str)                sb_data.flatten()
         pendants.flatten()                                sb_data=unique(sb_data)
         pendants=unique(pendants)                         sb_net.info()
         return G,pendants                                 return sb_net,sb_data
     except DecodeError:
     ...
     except KeyError:



                                                 Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                                  Research design       Getting data
                                 Digging into data      First scrape and build
                               Live demonstration

Build the whole network
  Step   Nodes   Edges   Mean Degree       Density              Our seed is abnormally isolated, with only four
  Seed   5       5       2.0               0.25                 neighbors
  k =2   75      115     3.0               0.02                 Large jump after first snowball
  k =3   4,938   8,659   3.5               3.6(10−4 )
                                                                Massive structural leap at k = 3




                                    Drew Conway         Mining and Analyzing Online Social Graph Data
Network basics
                                 Research design    Getting data
                                Digging into data   First scrape and build
                              Live demonstration

The full network
   To get a feeling for the size of the full network...




                                   Drew Conway      Mining and Analyzing Online Social Graph Data
Network basics
                   Research design
                  Digging into data
                Live demonstration

Live demo




            Live demonstration time!




                     Drew Conway      Mining and Analyzing Online Social Graph Data

More Related Content

Viewers also liked

Social networking sites selling information to third parties
Social networking sites selling information to third partiesSocial networking sites selling information to third parties
Social networking sites selling information to third partiesjulia594
 
How to Leverage the Social Graph with Facebook Platform
How to Leverage the Social Graph with Facebook PlatformHow to Leverage the Social Graph with Facebook Platform
How to Leverage the Social Graph with Facebook PlatformDave Olsen
 
Mining the social graph
Mining the social graphMining the social graph
Mining the social graphshunya kimura
 
Socialite, the Open Source Status Feed Part 2: Managing the Social Graph
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphSocialite, the Open Source Status Feed Part 2: Managing the Social Graph
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphMongoDB
 
Network measures used in social network analysis
Network measures used in social network analysis Network measures used in social network analysis
Network measures used in social network analysis Dragan Gasevic
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4jNeo4j
 
Antibioticoterapia resumen
Antibioticoterapia resumenAntibioticoterapia resumen
Antibioticoterapia resumenivan89aem
 
7장 네트워크로 세상을 읽다 : 사회 관계망 분석 입문하기
7장 네트워크로 세상을 읽다 : 사회 관계망 분석 입문하기7장 네트워크로 세상을 읽다 : 사회 관계망 분석 입문하기
7장 네트워크로 세상을 읽다 : 사회 관계망 분석 입문하기Hyochan PARK
 
Cyberbullying powerpoint
Cyberbullying powerpointCyberbullying powerpoint
Cyberbullying powerpointjosiebrookeday
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Top Rumors About Apple March 21 Big Event
Top Rumors About Apple March 21 Big EventTop Rumors About Apple March 21 Big Event
Top Rumors About Apple March 21 Big EventChromeInfo Technologies
 
Mobile-First SEO - The Marketers Edition #3XEDigital
Mobile-First SEO - The Marketers Edition #3XEDigitalMobile-First SEO - The Marketers Edition #3XEDigital
Mobile-First SEO - The Marketers Edition #3XEDigitalAleyda Solís
 

Viewers also liked (12)

Social networking sites selling information to third parties
Social networking sites selling information to third partiesSocial networking sites selling information to third parties
Social networking sites selling information to third parties
 
How to Leverage the Social Graph with Facebook Platform
How to Leverage the Social Graph with Facebook PlatformHow to Leverage the Social Graph with Facebook Platform
How to Leverage the Social Graph with Facebook Platform
 
Mining the social graph
Mining the social graphMining the social graph
Mining the social graph
 
Socialite, the Open Source Status Feed Part 2: Managing the Social Graph
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphSocialite, the Open Source Status Feed Part 2: Managing the Social Graph
Socialite, the Open Source Status Feed Part 2: Managing the Social Graph
 
Network measures used in social network analysis
Network measures used in social network analysis Network measures used in social network analysis
Network measures used in social network analysis
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
 
Antibioticoterapia resumen
Antibioticoterapia resumenAntibioticoterapia resumen
Antibioticoterapia resumen
 
7장 네트워크로 세상을 읽다 : 사회 관계망 분석 입문하기
7장 네트워크로 세상을 읽다 : 사회 관계망 분석 입문하기7장 네트워크로 세상을 읽다 : 사회 관계망 분석 입문하기
7장 네트워크로 세상을 읽다 : 사회 관계망 분석 입문하기
 
Cyberbullying powerpoint
Cyberbullying powerpointCyberbullying powerpoint
Cyberbullying powerpoint
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Top Rumors About Apple March 21 Big Event
Top Rumors About Apple March 21 Big EventTop Rumors About Apple March 21 Big Event
Top Rumors About Apple March 21 Big Event
 
Mobile-First SEO - The Marketers Edition #3XEDigital
Mobile-First SEO - The Marketers Edition #3XEDigitalMobile-First SEO - The Marketers Edition #3XEDigital
Mobile-First SEO - The Marketers Edition #3XEDigital
 

More from NYC Predictive Analytics

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsNYC Predictive Analytics
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionNYC Predictive Analytics
 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsNYC Predictive Analytics
 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionNYC Predictive Analytics
 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeNYC Predictive Analytics
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 

More from NYC Predictive Analytics (11)

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System Competition
 
R package Recommendation Engine
R package Recommendation EngineR package Recommendation Engine
R package Recommendation Engine
 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive Analytics
 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for Prediction
 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive Change
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 

Recently uploaded

Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Mining and Analyzing Online Social Graph Data

  • 1. Mining and Analyzing Online Social Graph Data Drew Conway New York University, Dept. of Politics May 13, 2010
  • 2. Agenda Network basics Unit of analysis Data representation Analysis & Visualization Thoughts on research design Edge contexts A bit on social graph data and ethics Digging into the data Where to get it Our first scrape and build! Real-time social graph analysis Build the network of Twitter users in the room Hold our breath...
  • 3. Live demo setup Last week Bruno asked how many of you use Twitter 15 (54%) members said yes, which may be enough to do some interesting stuff Please tweet the following hash-tag: #analyticsnyc Some example tweets: Many thanks to the @GiltGroupe for hosting tonight’s #analyticsnyc meetup Hanging out with @DKALab and @brunocm at #analyticsnyc I cannot wait to watch @drewconway crash and burn with his #analyticsnyc live demo
  • 4. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Networks and the study of relationships Network theory use the language of graph theory G = {V , E } ↑ In the abstract this is very powerful, we have a general purpose way to represent any number of relations G={Routers,Packet Traffic} Edges have non-binary values G={U.S. Airports, Commercial Routes} Edges can represent distance, cost, frequency, etc. G={New York City nerds, Co-membership in Meetups} Edges are implied! While both nodes and edges are needed to have a network of any substance, it is the edge (relationship) that will always be the primary focus of our analyses Drew Conway Mining and Analyzing Online Social Graph Data
  • 5. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Why focus on the edge? Consider a very simple example of two people meeting for the first time... In reality, the meeting may reveal a large If we focus on the nodes, we might degree of shared structure: consider this meeting creates a dyad: We know, however, that people do not exist as isolates and this assumption is ignoring all of By expressing this meeting in terms of the exogenous social structure these nodes as a function of their edges we may individuals brings with them. gain a much richer understanding of the structural dynamics Drew Conway Mining and Analyzing Online Social Graph Data
  • 6. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Representing network data: the canonical Perhaps the most natural way to represent the relationships between N actors is with an NxN matrix, often referred to as a “sociomatrix” X1 X2 ... XN X1 0 1 ... 0 Very intuitive representation X2 1 0 ... 1 Can support directional and weighted edges . . . . . . .. . . . . . . . Application of matrix algebraic operation for XN 0 1 ... 0 analysis As an introduction to representing relationship, a matrix provides insight to a network by itself. In practice, however, this representation has many limitations Unwieldy as network sizes scale up Most networks are sparse; too many zeroes Difficult to publish and share Fortunately, there are many other options for representing network data! Drew Conway Mining and Analyzing Online Social Graph Data
  • 7. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Representing network data: the practical Remember, it is all about the edges; therefore, more efficient representations will be limited to edge data Edge list Adjacency list A text file with two columns (usually Also a text file, however, here the first delimited by a space or tab), where first column is the source, and all column in source, and second is target subsequent entries are “adjacent” nodes 1 2 1 3 2 1 6 7 6 8 1 2 3 10 12 2 1 10 13 6 7 8 10 12 13 While these are the most universal data formats, network data representation is a bit of a cottage industry Pajek (.net) GraphML (.gml) GraphViz (.dot) ..and domain specific formats (eek!) Drew Conway Mining and Analyzing Online Social Graph Data
  • 8. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration A bit on tools The number of software suites and packages available for conducting social network analysis has exploded over the past ten years In general, this software can be categorized in two ways: Type - many SNA tools are developed to be standalone applications, while others are language specific packages Intent - consumers and producer of SNA come from a wide range of technical expertise and/or need, therefore, there exist simple tools for data collection and basic analysis, as well as complex suites for advanced research Standalone Apps Modules & Packages - ORA (Windows) - libSNA (Python) Basic - Analyst Notebook (Windows) - UrlNet (Python) - KrakPlot (Windows) - NodeXL (MS Excel) - UCINet (Windows) - NetworkX (Python) Advanced - Pajek (Multi) - JUNG (Java) - Network Workbench (Multi) - igraph (Python, R & Ruby) Many of the above tools have visualization components, but several tools are designed specifically for visualization: Graphviz, NetDraw, Tom Sawyer, Gephi, etc. What I use Drew Conway Mining and Analyzing Online Social Graph Data
  • 9. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Comparing two network metrics to find key actors Often network analysis is used to identify key actors within a social group. To identify these actors, various centrality metrics can be computed based on a network’s structure Degree (number of connections) Betweenness (number of shortest paths an actor is on) Closeness (relative distance to all other actors) Eigenvector centrality (leading eigenvector of sociomatrix) One method for using these metrics to identify key actors is to plot actors’ scores for Eigenvector centrality versus Betweenness. Theoretically, these metrics should be approximately linear; therefore, any non-linear outliers will be of note. An actor with very high betweenness but low EC may be a critical gatekeeper to a central actor Likewise, an actor with low betweenness but high EC may have unique access to central actors Drew Conway Mining and Analyzing Online Social Graph Data
  • 10. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration First, visualize the data Visualization can be the best first step in the analytical process This will give you a good feel for what is going on with your relationships. For large networks this is often not possible For this example, we will use the main component of the social network collected on drug users in Hartford, CT. The network has 194 nodes and 273 edges. Here I am using the GUESS visualizer in NWB with a Kamada-Kawai layout Drew Conway Mining and Analyzing Online Social Graph Data
  • 11. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Finding Key Actors with R The first steps are to load the data into memory, and perform some basic centrality analysis1 Load the data into igraph library(igraph) G<-read.graph("drug_main.txt",format="edgelist") G<-as.undirected(G) # By default, igraph inputs edgelist data as a directed graph. # In this step, we undo this and assume that all relationships are reciprocal. Store metrics in new data frame cent<-data.frame(bet=betweenness(G),eig=evcent(G)$vector) # evcent returns lots of data associated with the EC, but we only need the # leading eigenvector res<-lm(eig~bet,data=cent)$residuals cent<-transform(cent,res=res) # We will use the residuals in the next step 1 Weeks, et al (2002) http://dx.doi.org/10.1023/A:1015457400897 Drew Conway Mining and Analyzing Online Social Graph Data
  • 12. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Finding Key Actors with R Key Actor Analysis for Hartford Drug Users 1.0 44 Plot the data res library(ggplot2) 0.8 −0.2 # We use ggplot2 to make things a 0 # bit prettier 0.2 p<-ggplot(cent,aes(x=bet,y=eig, 28 53 0.4 label=rownames(cent),colour=res, 0.6 0.6 size=abs(res)))+xlab("Betweenness Centrality Eigenvector Centrality")+ylab("Eigenvector 50 abs(res) Centrality") 0.1 # We use the residuals to color and 0.4 0.2 # shape the points of our plot, 141 47 # making it easier to spot outliers. 20 56 0.3 155 58 43 p+geom_text()+opts(title="Key Actor 100 22 126 0.4 57 147 Analysis for Hartford Drug Users") 23 140 18 0.5 104 19 # We use the geom_text function to plot 0.2 75 65 84 139 41 9 0.6 13685 # the actors’ ID’s rather than points 48 17 87 55 54 63 89105 94 77 184 67 0.7 # so we know who is who 101 64 165 93 3349 162 86 91 177 170 21 36168 174 52 111 27 16 42 68 6 12 69 148 169 149 78 183 181 110 172 25 185 182 150 61 11 176 130 45 79 138 66 62 0.0 124 26 164 121 163 189 179 152 119 127 38 88 151 133 161 166 107276114399 46 5 97 608 118 4 120 240 137 109 107159 157144 73 116 51 95 90 135 186 190 123 187 156 173 192 146 96 122 82 125 143 134 37132142 72 13 112 71 167 31 153 145 128 129 188 193 191 15 171 92 39 13 180 158 175 178 106 108 194 113 98 81 59 83 80 74103 154 24 117 70 160 115 34 131 14 30 29 35 102 0 1000 2000 3000 4000 5000 6000 Betweenness Centrality Drew Conway Mining and Analyzing Online Social Graph Data
  • 13. Network basics The study of relationships Research design Data representation Digging into data Analysis & Visualization Live demonstration Key Actor Plot q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 141 q q q q q 155 Network plot q q q q q q q 58 47 44 q q q # Create positions for all of q 28 q q q q q q q q q q q q q q # the nodes w/ force directed l<-layout.fruchterman.reingold(G, 53 q q q q q q q q q q q niter=500) 20 50 q q q q q # Set the nodes’ size relative to q q q q q q q q q q q q # their residual value q q q q q q q V(G)$size<-abs(res)*10 q q q q q q q q # Only display the labels of key q q q # players q q q q q q nodes<-as.vector(V(G)+1) q q q q q q q q # Key players defined as have a q q q q # residual value >.25 nodes[which(abs(res)<.25)]<-NA q q q q 79 q # Save plot as PDF q q q pdf(‘actor_plot.pdf’,pointsize=7) q q q q plot(G,layout=l,vertex.label=nodes, q q 102 q vertex.label.dist=0.25, q q q q q q q q q q q q q q q vertex.label.color=‘red’,edge.width=1) q q q q q q q q q dev.off() q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Drew Conway Mining and Analyzing Online Social Graph Data
  • 14. Network basics Research design Edge context Digging into data Network data and ethics Live demonstration Not all edges are created equally I have spent a lot of time tonight describing network analysis as a way to understand relationships Depending on the source and context of the data, these relationships can be interpreted as many different things. Must consider the data generation process by which the edge was created With respect to online social graph data, ask: how do people use this service? Facebook Google SocialGraph Twitter Ties here are driven by Connecting all of Google’s Recent study shows personal contact social sites together Twitter is used as news me → offline → you me → anything → you aggregation service me → interests → you Offline relationships The combining of are driving multiple platforms Geography and “friending” into a single history less important Considerable amount “network” makes Networks may cluster analysis and of meta-data already around communities interpretation in FB, what does this of interest difficult add? Drew Conway Mining and Analyzing Online Social Graph Data
  • 15. Network basics Research design Edge context Digging into data Network data and ethics Live demonstration Ethics and network data Why should we be discusses ethics at a data analytics seminar? “People have really gotten comfortable not only sharing more information and “Money is a terrible master but an different kinds, but more openly and excellent servant” with more people. That social norm is just something that has evolved over “There’s a sucker born every minute” time.” P.T. Barnum Mark Zuckerburg, CEO, Facebook In isolation, the data we provide online about our relationships and preferences are fairly innocuous. Their summation, however, can illuminate aspects of our lives that can be exploited for any number of ways. Social networking services are moving previously private data to the public Simply because data is available does not mean that individuals want or realize that it can be used to make inferences about who they are or their lifestyles Analysts are caught in the middle. There is no IRB on the Internet! Drew Conway Mining and Analyzing Online Social Graph Data
  • 16. Network basics Research design Edge context Digging into data Network data and ethics Live demonstration Examples of ethically questionable network analyses MIT study predicts sexual orientation from Facebook friends Two undergraduate students start a project ‘Gaydar’ Claim they can predict which men are homosexual simply based on their friend structure Problem: Extremely private information, questionable methods Pleaserobme.com Combined information from fousquare.com and Twitter to publicize when people were clearly not at home Used as a proof of concept to show the danger of publicizing localization information Problem: Useful for raising awareness, hurtful to anyone who was actually robbed Project Grey Goose Large group of former and current DoD/IC analysts collaborated to study the identity of hackers I was personally involved in this project Using public web forum data, built up user profiles and social networks to attempt to identify hackers affiliated with Russian government Problem: While no names were published, bordering on vigilantism Drew Conway Mining and Analyzing Online Social Graph Data
  • 17. Network basics Research design Getting data Digging into data First scrape and build Live demonstration Where to get social graph data Recently, there has been an explosion of resources for scraping social graph Service Data API Docs Following(ers), @-replies, date/time/geo http://apiwiki.twitter.com/ Friends, Wall Posts, date/time http://developers.facebook.com/docs/api All SocialGraph relationships http://code.google.com/apis/socialgraph/ Friends, Check-ins http://foursquare.com/developers/ “Taste graph”, recommendations http://hunch.com/developers/ Congressional votes, campaign finance http://developer.nytimes.com/docs There is clearly no shortage of data Each service provides different relational context Data formats are generally JSON, Atom, XML, or some combination For a more extensive list of API resources, see HackNY wiki of local startups Drew Conway Mining and Analyzing Online Social Graph Data
  • 18. Network basics Research design Getting data Digging into data First scrape and build Live demonstration Building the social network among LiveJournal users Using a “seed” user, we will build out a network In Python, use NetworkX, cjson and a other standard scientific libraries, parse the SocialGraph data Through a process called “k-snowball searching” seed → friend → · · · → friendk Seed: imichaeldotorg.livejournal.com k =3 Note the low value of k Drew Conway Mining and Analyzing Online Social Graph Data
  • 19. Network basics Research design Getting data Digging into data First scrape and build Live demonstration The code, part 1 Loading the libraries and setting things up from cjson import * from urllib import * from networkx import * Name: [‘http://imichaeldotorg.livejournal.com/’] from time import * Type: DiGraph from scipy import array,unique Number of nodes: 5 ... Number of edges: 5 if __name__ == "__main__": Average in degree: 1.0 seed_url=‘‘http://imichaeldotorg.livejournal.com" Average out degree: 1.0 sg=get_sg(seed_url) net,newnodes=create_egonet(sg) info(net) Get the JSON from SocialGraph def get_sg(seed_url): sgapi_url="http://socialgraph.apis.google.com/lookup?q="+seed_url+"&edo=1&edi=1&fme=1&pretty=0" try: furl=urlopen(sgapi_url) fr=furl.read() furl.close() return fr except IOError: print "Could not connect to website" print sgapi_url return Drew Conway Mining and Analyzing Online Social Graph Data
  • 20. Network basics Research design Getting data Digging into data First scrape and build Live demonstration Build egonet and snowball Creating the egonet Rolling the snowball def create_egonet(s): def snowball_round(G,seeds,myspace=False): try: t0=time() raw=decode(s) if myspace: G=DiGraph() seeds=get_myspace_url(seeds) pendants=[] sb_data=[] n=raw[’nodes’] for s in range(0,len(seeds)): nk=n.keys() s_sg=get_sg(seeds[s]) G.name=str(nk) new_ego,pen=create_egonet(s_sg) pendants=[] for p in pen: for a in range(0,len(nk)): sb_data.append(p) for b in range(0,len(nk)): if s<1: if a!=b: sb_net=compose(G,new_ego) G.add_edge(nk[a],nk[b]) else: for k in nk: sb_net=compose(new_ego,sb_net) ego=n[k] del new_ego ego_out=ego[’nodes_referenced’] if s==round(len(seeds)*0.2): for o in ego_out: sb_net.name=’20% complete’ G.add_edge(k,o) sb_net.info() pendants.append(o) print ’AT: ’+strftime(’%m/%d/%Y, %H:%M:%S’, gmtime()) ego_in=ego[’nodes_referenced_by’] print ’’ for i in ego_in: ... G.add_edge(i,k) # More time keeping, probably a MUCH better way to do this pendants.append(i) sb_data=array(sb_data) pendants=array(pendants,dtype=str) sb_data.flatten() pendants.flatten() sb_data=unique(sb_data) pendants=unique(pendants) sb_net.info() return G,pendants return sb_net,sb_data except DecodeError: ... except KeyError: Drew Conway Mining and Analyzing Online Social Graph Data
  • 21. Network basics Research design Getting data Digging into data First scrape and build Live demonstration Build the whole network Step Nodes Edges Mean Degree Density Our seed is abnormally isolated, with only four Seed 5 5 2.0 0.25 neighbors k =2 75 115 3.0 0.02 Large jump after first snowball k =3 4,938 8,659 3.5 3.6(10−4 ) Massive structural leap at k = 3 Drew Conway Mining and Analyzing Online Social Graph Data
  • 22. Network basics Research design Getting data Digging into data First scrape and build Live demonstration The full network To get a feeling for the size of the full network... Drew Conway Mining and Analyzing Online Social Graph Data
  • 23. Network basics Research design Digging into data Live demonstration Live demo Live demonstration time! Drew Conway Mining and Analyzing Online Social Graph Data