2. PinSAGE (Graph Convolutional Neural Networks for Web-Scale Recommender Systems)
해당 논문은 2018년에 발표된 논문으로 GCN 이 여러가지 Task 에서 State-of-the-art 를 달성하고 Graph
Neural Network 가 주목받는 시점에 연구된 논문임.
해당 논문에서는 GCN 의 아래와 같은 Real Scale에 적용하기 어려운 문제를 지적하며, 해당 문제에 대한
해결 방안을 제시하는 것이 논문의 주요 요지이다.
“ despite the successes of GCN algorithms, no previous works have managed to apply them to production-
scale data with billions of nodes and edges—a limitation that is primarily due to the fact that traditional GCN
methods require operating on the entire graph Laplacian during training”
Node 수
Node 수
Node 의 수가 커지면, 전체 그래프를 Adjacency Matrix
로 표현해야 하는 GCN 에서는 물리적으로 해당 그래프를
해석하는 것이 불가능하다는 문제가 발생 함 .
※ Laplacian Matrix = Degree Matrix - Adjacency Matrix
3. 기존 GCN 에서 처리할 수 없는 Size (Billion 단위)의 Node 를 학습하여, Pin 을 잘 표현할 수 있는
Embedding을 만들어서 이를 활용하여, Embedding 간의 유사도 등을 활용하여 Pin을 추천하는 것이 목표
Pinterest is a content discovery application where users interact with pins, which are visual
bookmarks to online content (e.g., recipes they want to cook, or clothes they want to purchase).
Users organize these pins into boards, which contain collections of pins that the user deems to be
thematically related. Altogether, the Pinterest graph contains 2 billion pins, 1 billion boards, and over
18 billion edges (i.e., memberships of pins to their corresponding boards). Our task is to generate
high-quality embeddings or representations of pins that can be used for recommendation
1. Problem Setup
pin 을 그룹핑 한것이 Board (1 billion)
컨탠츠 하나 하나가 pin (2 billion)
“추천을 위해서 pin 을 잘 표현하는
Embedding 을 만드는게 목적임”
4. In order to learn these embeddings, we model the Pinterest environment as a bipartite graph
consisting of nodes in two disjoint sets, I (containing pins) and C (containing boards)
1. Problem Setup
음식 조각상
음악
스포츠
...
I (containing pins) C (containing boards)
동일 그룹 內
관계 없음
동일 그룹 內
관계 없음
다른 그룹 間
관계 있음
Node Attribute : Text ,
Image
Node Attribute : Meta Data
5. GraphSage 에서도 등장한 개념인데, 연결된 모든 노드를 연결하여 해석하는 것이 아닌, Sampling 을 통해서
연결된 노드의 수를 제한하는 방법으로 모든 노드를 연결해야 하는 문제를 경감하였다.
Localized convolutions by sampling the neighborhood around a node and dynamically constructing a
computation graph from this sampled neighborhood. These dynamically constructed computation graphs
(Fig. 1) specify how to perform a localized convolution around a particular node, and alleviate the need to
operate on the entire graph during training
2. Model Architecture (On-the-fly convolutions)
Sampling
6. 연결된 Node 를 Sampling 하는데, 그냥 Random 으로 하는 것이 아닌 Random Walk 를 통해 방문되는
Count 수를 생성한 후 , 정해진 T 개의 연결된 Node를 Sampling 할 때, 이 Random Walk를 통해서 생성된
Count 수가 높은 T 개를 선정할 수 있도록 한다는 것이다.
이런 방법론은 크게 두가지 유용한 점이 있음. (1) 유의미한 정보를 위주로 선택할 수 있다는 점. (2)
Aggregate 시에 Weighted Average 등을 사용하는데, 이때 가중치로 활용 할 수 있음 .
“Whereas previous GCN approaches simply examine k-hop graph neighborhoods, in PinSage we define
importance-based neighborhoods, where the neighborhood of a node u is defined as the T nodes that
exert the most influence on node u. Concretely,we simulate random walks starting from node u and
compute the L1-normalized, visit count of nodes visited by the random walk [14]. 2 The neighborhood of u is
then defined as the top T nodes with the highest normalized visit counts with respect to node u.”
2. Model Architecture (Importance-based neighborhoods)
Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time ( https://arxiv.org/pdf/1711.07601.pdf )
a
c
b
d
a b d
a c d
a b d
[graph] [random walk] [select top k]
d
b
c
3회
2회
1회
7. Each time we apply the convolve operation (Algorithm 1) we get a new representation for a node, and
we can stack multiple such convolutions on top of each other in order to gain more information about
the local graph structure around node u. In particular, we use multiple layers of convolutions, where
the inputs to the convolutions at layer k depend on the representations output from layer k − 1
(Figure 1) and where the initial (i.e., “layer 0”) representations are equal to the input node features.
Note that the model parameters in Algorithm 1 (Q, q, W, and w) are shared across the nodes but differ
between layers.
2. Model Architecture (Stacking convolutions)
Layer : 1 Layer : 2
모든 노드에서 Algorithm 1 의 학습 파라메터는
공유 됨. 단, Layer 간에는 공유되지 않음
Input
Output
8. In order to train the parameters of the model, we use a max-margin-based loss function. The basic idea is
that we want to maximize the inner product of positive examples, i.e., the embedding of the query item and the
corresponding related item. At the same time we want to ensure that the inner product of negative examples—
i.e., the inner product between the embedding of the query item and an unrelated item—is smaller than that of
the positive sample by some pre-defined margin. The loss function for a single pair of node embeddings (zq, zi) :
(q,i) ∈ L is thus
3. Model Training ( max-margin-based loss function )
Negative Sample 분포 Negative Positive margin-hyper parameter
Positive 한 Sample 과는 더 가까워 지도록 만든다!
Dot-Product 는 커지는 방향이 될 것
Negative 한 샘플과는 더 멀어지도록 만들고!
Dot-Product 는 작아지는 방향이 될 것
9. mini-batch 단위로 여러개의 GPU에 나눠서 Gradient 를 구하여 합치고, 하나의 SGD 연산을 수행함. 매우 많
은 아이템을 훈련해야 했기 때문에 512~4096 사이의 매우 큰 배치 사이즈를 사용하였다고 함 .
To make full use of multiple GPUs on a single machine for training, we run the forward and backward
propagation in a multi-tower fashion. With multiple GPUs, we first divide each minibatch (Figure 1 bottom)
into equal-sized portions. Each GPU takes one portion of the minibatch and performs the computations
using the same set of parameters. After backward propagation, the gradients for each parameter across all
GPUs are aggregated together, and a single step of synchronous SGD is performed. Due to the need to
train on extremely large number of examples (on the scale of billions), we run our system with large batch
sizes, ranging from 512 to 4096.
3. Model Training ( Multi-GPU training with large minibatches)
ALL
Nodes
GPU : 1
GPU : 2
mini
batch
mini
batch
gradients SGD
shared
parm
update
10. The training procedure has alternating usage of CPUs and GPUs. The model computations are in GPUs,
whereas extracting features, re-indexing, and negative sampling are computed on CPUs. In addition to
parallelizing GPU computation with multi-tower training, and CPU computation using OpenMP [25], we
design a producer consumer pattern to run GPU computation at the current iteration and CPU
computation at the next iteration in parallel. This further reduces the training time by almost a half
3. Model Training ( Producer-consumer minibatch construction)
[Producer : CPU]
Random Walk 를 통한 minibatch 생성, GPU 에서 작업하는
도중에 그 다음 Mini-Batch 를 동시에 생성하는 작업을 한다.
=> 이러한 로직이 거의 속도를 두배 이상 향상 시켰다고 함
[Consumer : GPU]
CPU 에서 생성된 Mini Batch 를 Aggregate, Update 하는 작
업을 수행한다.
[Producer]
t+1 연산 (CPU)
[Consumer]
t 연산 (GPU)
11. Negative sampling is used in our loss function (Equation 1) as an approximation of the normalization factor
of edge likelihood [23]. To improve efficiency when training with large batch sizes, we sample a set of 500
negative items to be shared by all training examples in each minibatch. This drastically saves the number of
embeddings that need to be computed during each training step, compared to running negative sampling
for each node independently. Empirically, we do not observe a difference between the performance of the
two sampling schemes. In the simplest case, we could just uniformly sample negative examples from the
entire set of items. However, ensuring that the inner product of the positive example (pair of items (q,i)) is
larger than that of the q and each of the 500 negative items is too “easy” and does not provide fine enough
“resolution” for the system to learn. To solve the above problem, for each positive training example (i.e.,
item pair (q,i)), we add “hard” negative examples, i.e., items that are somewhat related to the query item
q, but not as related as the positive item i. We call these “hard negative items”. They are generated by
ranking items in a graph according to their Personalized PageRank scores with respect to query item q [14].
Items ranked at 2000-5000 are randomly sampled as hard negative items.
3. Model Training ( Sampling negative items )
Performance 문제를 위해서 500개를 뽑아서 공유
=> 너무 쉬움, 충분한 분별력을 학습하지 못함
Query 와는 유사하지만 Positive Item 과는 유사
하지 않은 “Hard Negative” 를 추가함.
curriculum training으로 점진적으로 비중 증가
12. 모델이 훈련된 이후에도, 모든 노드의 Embedding 을 구하는 일은 여전히 도전적인 일이다. 훈련 과정에서
보았던 것처럼, 각 노드의 연산간에는 굉장히 많은 Over-lap 이 있을수 밖에 없다. 이러한 문제들을 해결하기
위해서 MapReduce 접근 방법을 고안하였다.
(1) One MapReduce job is used to project all pins to a low dimensional latent space, where the aggregation
operation will be performed (Algorithm 1, Line 1).
(2) Another MapReduce job is then used to join the resulting pin representations with the ids of the boards
they occur in, and the board embedding is computed by pooling the features of its (sampled) neighbors.
3. Node Embeddings via MapReduce
Figure 3 details the data flow on the bipartite pin-to-board Pinterest graph, where we assume the
input (i.e., “layer-0”) nodes are pins/items (and the layer-1 nodes are boards/contexts).
[(1) pins/items] [boards/contexts] [context 단위 group by pooling] [(2) context 에 대한 embedding]
13. 4. Experiment
“Combining both visual/textual and graph
information, PinSage is able to find relevant
items that are both visually and topically
similar to the query item.”
Table 1 compares the performance of the various approaches
using the hit rate as well as the MRR.5 PinSage with our new
importance-pooling aggregation and hard negative examples
achieves the best performance at 67% hit-rate and 0.59 MRR,
outperforming the top baseline by 40% absolute (150% relative)
in terms of the hit rate and also 22% absolute (60% relative) in
terms of MRR
[mean reciprocal rank]