4. HGT(Heterogeneous Graph Transformer)
Recent years have witnessed the emerging success of graph neural networks (GNNs) for modeling
structured data. However, most GNNs are designed for homogeneous graphs, in which all nodes and edges
belong to the same types, making them infeasible to represent heterogeneous structures.
๊ธฐ์กด์ ์ฐ๊ตฌ๋ค์ ๋ ธ๋์ ๋งํฌ์ ํํ๊ฐ ํต์ผ๋ homogeneous ๊ทธ๋ํ์ ํ์ ํ์ฌ ์ฐ๊ตฌ๊ฐ ์งํ๋์์ผ๋,
์ค์ ํ๋์์๋ ๋ ธ๋์ ๋งํฌ์ ํ์ ๊ณผ ํํ๊ฐ ๋ค๋ฅธ heterogeneous ๊ฐ ์ผ๋ฐ์ ์ผ๋ก ์ด๋ฌํ ๋ฌธ์ ํด๊ฒฐ์ ์ํ
์ฐ๊ตฌ๋ฅผ ์งํ ํจ.
[heterogeneous]
[homogeneous]
Node ์ Link ์ ํํ๊ฐ ๋ค๋ฆ !
8. Heterogeneous Graph Transformer Approach
[๋์ผํ ๋ ธ๋ ๊ฐ์๋ ๋ค์ํ ๊ด๊ณ ํํ]
(1) Instead of attending on node or edge type alone, we use the meta relation โจฯ (s),ฯ(e), ฯ (t)โฉ to decompose the
interaction and transform matrices, enabling HGT to capture both the common and specific patterns of different
relationships using equal or even fewer parameters.
์ ์์ ๋ ผ๋ฌธ๊ฐ์ 1์ ์, 2์ ์, 3์ ์ ๋ฑ
๋์ผ ๋ ธ๋ ๊ฐ์๋ ๋ค์ํ ๊ด๊ณ๊ฐ ์กด์ฌํ
์ ์๋๋ก ํํ
9. Heterogeneous Graph Transformer Approach
[์ํคํ์ณ ์์ฒด๋ก soft meta path ๊ตฌํ]
(2) Different from most of the existing works that are based on customized meta paths, we rely on the nature
of the neural architecture to incorporate high-order heterogeneous neighbor information, which automatically
learns the importance of implicit meta paths.
Due to the nature of its architecture, HGT can incorporate information from high-order neighbors of different
types through message passing across layers, which can be regarded as โsoftโ meta paths. That said, even if
HGT take only its one-hop edges as input without manually designing meta paths, the proposed attention
mechanism can automatically and implicitly learn and extract โmeta pathsโ that are important for different
downstream tasks.
์ฌ์ ์ง์์ ๊ฐ์ง๊ณ Meta Path ๋ฅผ ์ ์ํ์ฌ ์ฌ์ฉ Attention ๊ธฐ๋ฐ์ผ๋ก ์ค์ํ path ๋ฅผ ์ฐพ์์ ์์์ ๋ฐฐ์ธ ์
์๋ ์ ๊ฒฝ๋ง์ ์ธ ๊ตฌ์กฐ๋ฅผ ๋ง๋ค์์ด! => ๋ค์์ ์์ธ ์ค๋ช
10. Heterogeneous Graph Transformer Approach
[Positional Encoding ์ด์ฉ Time Gap ์ ์ฉ]
(3) Most previous works donโt take the dynamic nature of (heterogeneous) graphs into consideration, while we
propose the relative temporal encoding technique to incorporate temporal information by using limited
computational resources.
(4) None of the existing heterogeneous GNNs are designed for and experimented with Web-scale graphs, we
therefore propose the heterogeneous Mini-Batch graph sampling algorithm designed for Web-scale graph training,
enabling experiments on the billion-scale Open Academic Graph.
[Heterogeneous Graph ์ Balance ๋ฅผ ๊ณ ๋ คํ Mini-Batch ๊ธฐ๋ฒ]
16. Overall Architecture-Heterogeneous Mutual Attention ์์ธ (2)
we add a prior tensor ยต โ R |A |ร |R |ร
|A | to denote the general significance of
each meta relation triplet
Therefore, unlike the vanilla Transformer
that directly calculates the dot product
between the Query and Key vectors, we
keep a distinct edge based matrix W ATT
ฯ(e) โ R d h ร d h for each edge type ฯ(e).
In doing so, the model can capture
different semantic relations even between
the same node type pairs
Linear projection (or fully connected layer) is
perhaps one of the most common operations in
deep learning models. When doing linear
projection, we can project a vector x of
dimension n to a vector y of dimension size
m by multiplying a projection matrix W of
shape [n, m]
Heterogeneous Mutual Attention
17. Overall Architecture-Heterogeneous Mutual Attention ์์ธ (3)
Triplet (T - E - S) ์ ์ค์๋๋ฅผ ๋ฐ์
๊ฐ๊ฐ ๋ค๋ฅธ ๋ฐฑํฐ ์ฌ์ด์ฆ๋ฅผ Fully Connected Layer
์ฐ์ฐ์ ํตํด ๋์ผํ ์ฌ์ด์ฆ๋ก ๋ง์ถ๋ ์์
Q K
แ
์๋ Transformer ์์ Attention
๋ณํ๋ Attention (๋จ์ T ์ S ์ ์ ์ฌ๋๊ฐ
์๋, T - E - S ๋ฅผ ๊ณ ๋ คํ ์ ์ฌ๋๋ฅผ ๊ตฌํจ)
T E
แ S
แ
target edge source
Heterogeneous Mutual Attention
20. Overall Architecture-Relative Temporal Encoding
The traditional way to incorporate temporal information is to construct a separate graph for each time slot.
However, such a procedure may lose a large portion of structural dependencies across different time slots. We
propose the Relative Temporal Encoding (RTE) mechanism to model the dynamic dependencies in
heterogeneous graphs. RTE is inspired by Transformerโs positional encoding method .Specifically, given a
source node s and a target node t, along with their corresponding timestamps T (s) and T (t), we denote the
relative time gap โT (t,s) = T (t) โT (s) as an index to get a relative temporal encoding RT E(โT (t,s)).
์ง
ํ
ํ๊ฒ ๋ ธ๋์ ์์ค ๋ ธ๋์ ์๊ฐ Gap ์ Positional Encoding ๊ธฐ๋ฒ์ผ๋ก
Embedding ํ๊ณ ์ด ๊ฐ์ Source ์ (+) ํ๋ ๋ฐฉ๋ฒ์ผ๋ก ์๋์ ์ธ ์๊ฐ
์ฐจ์ด๋ฅผ ํํํ๋ค.
22. ์ฐธ๊ณ - Layer Dependent Importance Sampling
Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks
https://arxiv.org/pdf/1911.07323.pdf
์ค์ ๋ก๋ Normalized Laplacian (์์๋ D-A)
i ๋ฒ์งธ
23. Overall Architecture - HGSampling
[๊ธฐ์กด Sampling ์ ์ฉ์ ๋ฌธ์ ์ ] directly using them for
heterogeneous graphs is prone to get sub-graphs that are
extremely imbalanced regarding different node types, due to
that the degree distribution and the total number of nodes for
each type can vary dramatically
=> Sub-Graph ๊ฐ ๋ ธ๋ ํ์ ๊ณผ ๋ฐ์ดํฐ์ ๋ถํฌ ์ ๋๋ผ๋ ์ธก๋ฉด์
์ ๋งค์ฐ ๋ถ๊ท ํ ํด์ง ์๊ฐ ์๋ค.
[ํด๊ฒฐ ๋ฐฉ์]
1) keep a similar number of nodes and edges for each type
=> ๋ ธ๋์ ์ฃ์ง ํ์ ์ ๊ท ๋ฑํ๊ฒ ๋๋๋ก Type ์ ๊ตฌ๋ถํด์
๊ด๋ฆฌํ๋ Budget ์ ์ด์ํ๋ค.
2) keep the sampled sub-graph dense to minimize the
information loss and reduce the sample variance
=> ์ฌ๋ฌ๊ฐ์ ์ธ์ ๋ ธ๋๋ค ์ค์ ๋ฌด์์ ์ ํํ ์ง์ ๋ฌธ์ ์ ์
์ด์, ์ธ์ ๋ ธ๋์ Sampling Probability ๋ฅผ ๊ตฌํ์ฌ Sampling
Layer-Dependent Importance Sampling ์ฐธ์กฐ
algorithm2
24. Overall Architecture - Inductive Timestamp Assignment
Till now we have assumed that each node t is assigned with a timestamp T (t). However, in real-world
heterogeneous graphs, many nodes are not associated with a fixed time. Therefore, we need to assign
different timestamps to it. We denote these nodes as plain nodes.
Algorithm 1 ์ Add-In-Budget Method ์์ธ
(Budget ์ ๋ด์ ๋, Time Stamp ๊ฐ ์๋ค๋ฉด,
Target ์ Time ์ ์์ ๋ฐ๋๋ก ํจ)
Time ์ด ์๋ ๊ฒฝ์ฐ์๋ D_t ๋ฅผ ๋ํด ์ค