2. Networks
A flexible and general data structure.
Many types of data can be formulated as
networks.
Universal language for describing complex
data.
Networks from science, nature,
and technology are more similar
than one would expect.
3. Classical ML tasks in
networks
Node classification - Predict a type of a given
node
Link prediction - Predict whether two nodes
are linked
Community detection - Identify densely
linked clusters of nodes
Network similarity - How similar are two
(sub)networks.
4. Graph Embedding
Graph Embedding is an approach that is used to
transform nodes, edges, and their features into
vector space (a lower dimension) whilst
maximally preserving properties like graph
structure and information.
5. DeepWalk
DeepWalk uses local information obtained from
truncated random walks to learn latent
representations by treating walks as the
equivalent of sentences.
6. Deepwalk learns social representations of a
graph’s vertices, by modeling a stream of short
random walks.
Social representations are latent features of the
vertices that capture neighborhood similarity and
community membership.
DeepWalk takes a graph as input and produces a
latent representation as an output. The result of
applying our method to the well-studied Karate
network is shown
7. Connection: Power laws
The frequency which vertices appear in the short
random walks will also follow a power-law
distribution. Word frequency in natural language
follows a similar distribution, and techniques
from language modeling account for this
distributional behavior.
9. SkipGram
SkipGram is an algorithm that is used to create
word embeddings i.e. high-dimensional vector
representation of words. These embeddings are
meant to encode the semantic meaning of words
such that words that are semantically similar will
lie close to each other in that vector's space.
10. Hierarchical Softmax
If we assign the vertices to the leaves of a binary
tree, the prediction problem turns into
maximizing the probability of a specific path in
the tree. If the path to vertex uk is identified by a
sequence of tree nodes (b0, b1, . . . , bdlog |V |e),
(b0 = root, bdlog |V |e = uk) then
Pr(uk | Φ(vj )) = dlog Y |V |e l=1 Pr(bl | Φ(vj ))
Now, Pr(bl | Φ(vj )) could be modeled by a binary
classifier that is assigned to the parent of the node
bl. This reduces the computational complexity of
calculating Pr(uk | Φ(vj )) from O(|V |) to O(log |
V |).
11. Node2vec
Node2vec, an algorithmic framework for learning
continuous feature representations for nodes in
networks. In node2vec, we learn a mapping of
nodes to a low-dimensional space of features that
maximizes the likelihood of preserving network
neighborhoods of nodes. We define a flexible
notion of a node’s network neighborhood and
design a biased random walk procedure, which
efficiently explores diverse neighborhoods. Our
algorithm generalizes prior work which is based
on rigid notions of network neighborhoods, and
we argue that the added flexibility in exploring
neighborhoods is the key to learning richer
representations.
12. BFS
Neighborhood is restricted to nodes which are
immediate neighbors of the source .
For a neighborhood of size k=3 BFS samples
nodes s1,s2,s3
BFS <-> Structural Equivalence
Nodes that have similar structural roles in
networks should be embedded closely together.
E.g., nodes u and s6 in fig
Restricting search to nearby nodes, BFS gives
microscopic view.
Network roles such as bridges and hubs can be
inferred using BFS.
13. DFS
Neighborhood consists of nodes sequentially
sampled at increasing distances from the source
node .
For a neighborhood of size k=3 DFS
samples nodes s4,s5,s6 .
DFS <-> Homophily
Nodes that are highly interconnected and belong
to similar network communities should be
embedded closely together.
E.g., nodes u and s1 in fig
DFS sampled nodes reflect a macro-view of
nodes neighborhood.
14. Drawbacks of DeepWalk
DeepWalk: learns d-dimensional feature
representations by simulating uniform random
walks. Can be observed as a special case of
node2vec with parameters p = 1 & q = 1.
Networks represent a mixture of homophily and
structural equivalence, which are not effectively
covered by the above method.
15. Feature Learning Framework
Let G = (V, E) be a given network. applies to any
(un)directed, (un)weighted network. Let f : V →
R d be the mapping function from nodes to
feature representaions . Here d is a parameter
specifying the number of dimensions of our
feature representation. Equivalently, f is a matrix
of size |V | × d parameters. For every source node
u V , we define NS(u) V as a network
∈ ⊂
neighborhood of node u generated through a
neighborhood sampling strategy S.
f: max f X u V log P r(NS(u)|f(u))
∈
Conditional independence. We factorize the
likelihood by assuming that the likelihood of
observing a neighborhood node is independent of
observing any other neighborhood node given the
feature representation of the source:
P r(NS(u)|f(u)) = Y ni NS (u) P r(ni|f(u)).
∈
model the conditional likelihood of every source-
neighborhood node pair as a softmax unit
parametrized by a dot product of their features: P
r(ni|f(u)) = exp(f(ni) · f(u)) P v V exp(f(v) ·
∈
f(u)). With the above assumptions, the objective
16. in Eq. 1 simplifies to: max f X u V − log Zu +
∈
X ni NS (u) f(ni) · f(u) .
∈
Given the linear nature of text, the notion of a
neighborhood can be naturally defined using a
sliding window over consecutive words.
Networks, however, are not linear, and thus a
richer notion of a neighborhood is needed. To
resolve this issue, we propose a randomized
procedure that samples many different
neighborhoods of a given source node u. The
neighborhoods NS(u) are not restricted to just
immediate neighbors but can have vastly different
structures depending on the sampling strategy S.
Search bias α
We define a 2nd order random walk with two
parameters p and q which guide the walk:
Consider a random walk that just traversed edge
(t, v) and now resides at node v (Figure 2). The
walk now needs to decide on the next step so it
evaluates the transition probabilities πvx on edges
(v, x) leading from v. We set the unnormalized
transition probability to
17. πvx = αpq(t, x) · wvx,
wvx = weight between node v and x
and dtx denotes the shortest path distance
between nodes t and x. Note that dtx must be one
of {0, 1, 2}, and hence, the two parameters are
necessary and sufficient to guide the walk.
Intuitively, parameters p and q control how fast
the walk explores and leaves the neighborhood of
starting node u. In particular, the parameters
allow our search procedure to (approximately)
interpolate between BFS and DFS and thereby
reflect an affinity for different notions of node
equivalences.
18. Formally, given a source node u, we simulate a
random walk of fixed length l. Let ci denote the
ith node in the walk, starting with c0 = u. Nodes
ci are generated by the following distribution:
P(ci = x | ci−1 = v) = (πvx Z if (v, x) E 0
∈
otherwise where πvx is the unnormalized
transition probability between nodes v and x, and
Z is the normalizing constant.
Return parameter, p. Parameter p controls the
likelihood of immediately revisiting a node in the
walk. Setting it to a high value (> max(q, 1))
ensures that we are less likely to sample an
alreadyvisited node in the following two steps
(unless the next node in the walk had no other
neighbor). This strategy encourages moderate
exploration and avoids 2-hop redundancy in
19. sampling. On the other hand, if p is low (< min(q,
1)), it would lead the walk to backtrack a step
(Figure 2) and this would keep the walk “local”
close to the starting node u. In-out parameter, q.
Parameter q allows the search to differentiate
between “inward” and “outward” nodes. Going
back to Figure 2, if q > 1, the random walk is
biased towards nodes close to node t. Such walks
obtain a local view of the underlying graph with
respect to the start node in the walk and
approximate BFS behavior in the sense that our
samples comprise of nodes within a small
locality. In contrast, if q < 1, the walk is more
inclined to visit nodes which are further away
from the node t. Such behavior is reflective of
DFS which encourages outward exploration.
However, an essential difference here is that we
achieve DFS-like exploration within the random
walk framework. Hence, the sampled nodes are
not at strictly increasing distances from a given
source node u, but in turn, we benefit from
tractable preprocessing and superior sampling
efficiency of random walks. Note that by setting
πv,x to be a function of the preceeding node in the
walk t, the random walks are 2nd order
Markovian.