SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
H-Transformer-1D:
Fast One Dimensional Hierarchical Attention For Sequences
2021.08.15
자연어처리팀
진명훈 백지윤 신동진
00. 논문 및 저자 소개
Zhenhai Zhu
NLU / CV / Numerical Analysis
/ Computer-aided Design
https://ai.facebook.com/people/sinong-wang/
Belinda Z. Li
NLP / NMT / Language Generation
/ Syntactic Parsing / Discourse Parsing
https://belindal.github.io/
2021.07.25
01.
문제가 뭐야?
Transformers a.k.a. Self-Attention
Self-attention 기반의 트랜스포머는 분야를 가리지 않고 정말 잘해요!
Self-Attention Bottleneck
트랜스포머의 핵심 아이디어 Self-Attention의 계산 복잡도가 𝒪(𝑛2
)에요!
Related Works
위 문제를 다른 논문에선 어떻게 접근했을까요?
01. 문제가 뭐야?
Transformers a.k.a. Self-Attention
• Attention은 RNN (Luong et al., 2015), CNN (Bello et al., 2019), GCN (Velickovic et al., 2018)의 핵심 block
• Linearly combining information using content-based weights
• 그 중 Multi-Head Scaled Dot-Product Attention은 다양한 이해, 생성 Tasks에서 SOTA를 달성 중인 Transformer (Vaswani et al., 2017)의 핵심 구조
• 진짜 만능... 어찌나 잘하던지 아래 Tasks들에서 전부 SOTA
• Machine Translation, Document Classification, Entailment, Summarization, Question Answering (BigBird, Transformer-XL, Adaptive input repr for neural LM)
• Music Generation (Music Transformer)
• Image Generation (Generative pretraining from pixels, Image Transformer)
• Genomics (BigBird, MLM for proteins via linearly scalable long-context transformers)
• Transformer는 BERT (Devlin et al., 2019)와 GPT-3 (Brown et al., 2020)의 Backbone 모델이기도 함
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities
01. 문제가 뭐야?
Transformers a.k.a. Self-Attention
𝑄𝑊
𝑖
𝑄
∈ ℝ5×3
𝐾𝑊𝑖
𝐾
∈ ℝ3×5
𝐴𝑡𝑡 ∈ ℝ5×5
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
𝑘1 𝑘2 𝑘3 𝑘4 𝑘5
𝑎11 𝑎12 𝑎13 𝑎14 𝑎15
𝑣1 𝑣2 𝑣3
𝑜11
𝑉𝑊𝑖
𝑉
∈ ℝ5×3
𝑂 ∈ ℝ5×3
</s> 𝑠1 𝑠2 𝑠3 𝑠4
Embedding + Projection
𝑞5
𝑡4
Embedding
+
Projection
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities
Self-Attention
softmax
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾 𝑇
𝑑
𝑉𝑊𝑖
𝑉
01. 문제가 뭐야?
Transformers a.k.a. Self-Attention
𝑄𝑊
𝑖
𝑄
∈ ℝ5×3
𝐾𝑊𝑖
𝐾
∈ ℝ3×5
𝐴𝑡𝑡 ∈ ℝ5×5
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
𝑘1 𝑘2 𝑘3 𝑘4 𝑘5
𝑎11 𝑎12 𝑎13 𝑎14 𝑎15
𝑣1 𝑣2 𝑣3
𝑜11
𝑉𝑊𝑖
𝑉
∈ ℝ5×3
𝑂 ∈ ℝ5×3
</s> 𝑠1 𝑠2 𝑠3 𝑠4
Embedding + Projection
𝑞5
𝑡4
Embedding
+
Projection
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities
Self-Attention
softmax
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾 𝑇
𝑑
𝑉𝑊𝑖
𝑉
01. 문제가 뭐야?
Transformers a.k.a. Self-Attention
𝑄𝑊
𝑖
𝑄
∈ ℝ5×3
𝐾𝑊𝑖
𝐾
∈ ℝ3×5
𝐴𝑡𝑡 ∈ ℝ5×5
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
𝑘1 𝑘2 𝑘3 𝑘4 𝑘5
𝑎11 𝑎12 𝑎13 𝑎14 𝑎15
𝑣1 𝑣2 𝑣3
𝑜11
𝑉𝑊𝑖
𝑉
∈ ℝ5×3
𝑂 ∈ ℝ5×3
</s> 𝑠1 𝑠2 𝑠3 𝑠4
Embedding + Projection
𝑞5
𝑡4
Embedding
+
Projection
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities
Self-Attention
softmax
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾 𝑇
𝑑
𝑉𝑊𝑖
𝑉
01. 문제가 뭐야?
Self-Attention Bottleneck
𝑄𝑊
𝑖
𝑄
∈ ℝ5×3
𝐾𝑊𝑖
𝐾
∈ ℝ3×5
𝐴𝑡𝑡 ∈ ℝ5×5
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
𝑘1 𝑘2 𝑘3 𝑘4 𝑘5
𝑎11 𝑎12 𝑎13 𝑎14 𝑎15
𝑣1 𝑣2 𝑣3
𝑜11
𝑉𝑊𝑖
𝑉
∈ ℝ5×3
𝑂 ∈ ℝ5×3
</s> 𝑠1 𝑠2 𝑠3 𝑠4
Embedding + Projection
𝑞5
𝑡4
Embedding
+
Projection
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities
두 Matrix Multiplication이 Bottleneck!!
softmax
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾 𝑇
𝑑
𝑉𝑊𝑖
𝑉
𝒪 𝐿2
𝑑
01. 문제가 뭐야?
Self-Attention Bottleneck
• 그러나 sequence 길이에 비례하여 quadratic operation을 요함 (∼ 𝒪 𝐿2
𝑑 )
• 이는 아주 긴(특히 1,000개 이상)의 토큰을 처리할 때 굉장히 심각한 bottleneck으로 작용한다고 한다.
• Efficient Transformers: A survey (Tay et al., 2020d)
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities
01. 문제가 뭐야?
Related Works
https://arxiv.org/abs/2009.06732
Subquadratic self-attention 연구가 많이 진행되었어요!
01. 문제가 뭐야?
Related Works
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
time
𝒪 ℓ ℓ 𝒪 ℓlogℓ 𝒪 ℓlogℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ
Sparse Transformer
(Child et al., 2019)
Routing Transformer
(Roy et al., 2020)
Linformer
(Wang et al., 2020)
Big Bird
(Zaheer et al., 2020)
Reformer
(Kitaev et al., 2020)
Performer
(Choromanski et al., 2020)
Linear Transformer
(Katharopoulos et al., 2020)
01. 문제가 뭐야?
Related Works
Keys
Queries
Goal: 효과적인 연산을 위해 attention 연산을 근사!
01. 문제가 뭐야?
Related Works
Keys
Queries
• Data-Independent Patterns
• Blockwise Transformer
• Sparse Transformer
• Longformer
• Big Bird
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
01. 문제가 뭐야?
Related Works
Keys
Queries
• Data-Independent Patterns
• Data-Dependent Patterns
• Linformer
• Reformer
• Routing Transformer
• Clustered Attention
• Sinkhorn Transformer
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
01. 문제가 뭐야?
Related Works
• Data-Independent Patterns
• Data-Dependent Patterns
• Kernels and Alternative Attention Mechanisms
• Linear Transformer
• Random Feature Attention
• Performer
• Synthesizer
• Recurrence
• Transformer-XL
• Compressive Transformers
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
01. 문제가 뭐야?
Related Works
• Data-Independent Patterns
Blockwise Strided Diagonal Random Global
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
Blockwise Transformer
Local Attention
Sparse Transformer
Longformer
Big Bird
Longformer
Big Bird Big Bird
Longformer
ETC
• Data-dependent Patterns
01. 문제가 뭐야?
Related Works
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
Buckets
Buckets: Hasing
Sorting and
blocking
Compression
Buckets: Clustering
01. 문제가 뭐야?
Related Works
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
02.
이를 해결하기 위한 방법은?
H-Matrix and Multigrid Method + Intuition
Attention 행렬의 Sparsity를 다루기 위한 Numerical 방법을 소개합니다
Hierarchical Attention and its computational complexity
위 개념을 적용한 H-Transformer-1D 알고리즘과 복잡도롤 소개합니다
How to implement?
Lucidrain님의 구현체를 참고하여 torch로 어떻게 구현했는지 확인합니다
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
Q)
왜 꼭 어떤 pattern을 0으로 지워야 하는가?
더 수학적(numerical analysis)으로 행렬을 Sparse하게 만드는 기법이 많은데!
그 중 Hierarchical–Matrix를 사용하면 특별히 잘 되지 않을까?
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
𝑍 = softmax
𝑄 𝐾 𝑇
𝑑
𝑉
𝑍 = 𝐷−1
𝐴𝑉
𝐴 = 𝑒𝑆
Self-Attention을 위의 수식처럼 쓸게요!
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
𝑄 ∈ ℝ16×𝑑
𝐾 ∈ ℝ𝑑×16
𝐴 ∈ ℝ16×16
≈
Self-Attention is low-rank matrix
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
𝐴𝑖,𝑗 = 𝑒𝑆𝑖,𝑗
𝑆𝑖,𝑗 = 2𝑒− 𝑖−𝑗 2
− 1
2.7183
0.7678
0.3816
0.3680
0.3679
• 왼쪽의 행렬로 예시를 들어보자!
• Sequence length는 16
• Looser tolerance를 10−1
로 줘도 행렬 𝐴는 Full-rank임
• 즉, 표준적인 low-rank approximation은 효과적이지 못함
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
𝐴𝑖,𝑗 = 𝑒𝑆𝑖,𝑗
𝑆𝑖,𝑗 = 2𝑒− 𝑖−𝑗 2
− 1
2.7183
0.7678
0.3816
0.3680
0.3679
• 이걸 두 level의 matrix block으로 분리해보자! (hierarchy하게)
• 그러면 Looser tolerance를 10−3
에서 각 block 행렬의 rank는 아래와
같아요!
• Off-diagonal term에도 value가 있기에 함부로 무시하면 성능에 영향
을 끼침 (poor approximation)
• 우리의 방법은 level을 높여서 compression rate를 높이는 것도 가능!
4
2 4
2
2
4
4
2
2
2
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
𝑄 ∈ ℝ16×𝑑
𝐾 ∈ ℝ𝑑×16
𝐴 ∈ ℝ16×16
4
2 4
2
2
4
4
2
2
2
≈
4
2 4
2
2
4
4
2
2
2
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
Multigrid method!
Multi-level nested iterative method for solving large-scale sparse matrices
Results from discretized partial differential equations
https://www.cambridge.org/core/books/introduction-to-numerical-geodynamic-modelling/multigrid-method/D8858D6C897D3AC0F44E6C296E86585F
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
https://www.researchgate.net/figure/Illustration-of-the-multigrid-V-cycle_fig2_328599327
coarsening
Simple Average Simple duplicate
02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
Hierarchical Structure
→ Inductive Bias
4
2 4
2
2
4
4
2
2
2
• 논문에선 다음과 같이 설명
• Sharp Nearby, Fuzzy far away!
• 문장이 길어지면 순서는 성능에 거의 영향을 미치지 않음
• 모델은 멀리 떨어진 단어에 대해 high-level 대략적인 표현을 유지
• Multilevel method의 핵심 아이디어는 아래임!
• perform no approximation for near interactions, and apply
progressively lower-precision approximation for progressively
longer distance interactions
• Full-rank diagonal blocks (near interactions)
• Higher precision approx. for 4X4 off-diagonal (mid-distance)
• Low precision approx. for 8X8 off-diagonal (long-distance)
• Inductive Bias: 우리가 세운 가설 ㅎㅎ
• Attention matrix는 hierarchical low-rank 구조로 이뤄져 있을 거야!
• Good Benchmark Performance가 이를 뒷받침하지!
• 그런 것 치고 실험 세팅이... Hmm...
02. 이를 해결하기 위한 방법은?
Hierarchical Attention and its computational complexity
time
𝒪 ℓ ℓ 𝒪 ℓlogℓ 𝒪 ℓlogℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ
Sparse Transformer
(Child et al., 2019)
Routing
Transformer
(Roy et al., 2020)
Linformer
(Wang et al., 2020)
Big Bird
(Zaheer et al., 2020)
Reformer
(Kitaev et al., 2020)
Performer
(Choromanski et al., 2020)
Linear Transformer
(Katharopoulos et al.,
2020)
Luna: Linear Unified Nested Attention
(Xuezhe ma et al., 2021)
H-Transformer-1D
(Zhenhai Zhu et al., 2021)
𝒪 ℓ
02. 이를 해결하기 위한 방법은?
How to implement?
Q)
이거는 어떻게 구현할 수 있을까?
실제로 실험 결과가 Inductive Bias와 속도를 지지할까?
02. 이를 해결하기 위한 방법은?
How to implement?
https://github.com/lucidrains/h-transformer-1d/issues/1
02. 이를 해결하기 위한 방법은?
How to implement?
https://github.com/lucidrains/h-transformer-1d/commit/5660abdab7c7359c9c178032c73f37a3241937d1
⊕ RotaryEmbedding
⊕ Reversible Residual Connection
02. 이를 해결하기 위한 방법은?
How to implement?
"안녕하세요 오늘 논문 읽기 모임에 참여한 여러분들 모두 환영합니다 Google의 논문인데 많이 아쉽네요"
['▁안녕', '하세요', '▁오늘', '▁논문', '▁읽', '기', '▁모임', '에', '▁참여한', '▁여러분', '들', '▁모두', '▁환영', '합', '니다', '▁G', 'oo', 'g', 'le', '의', '▁논문', '인데', '▁많이', '▁아쉽', '네요']
[22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221]
(bsz,) == (1,)
(bsz, seq_len) == (1, 25)
(bsz, seq_len) == (1, 25)
02. 이를 해결하기 위한 방법은?
How to implement?
"안녕하세요 오늘 논문 읽기 모임에 참여한 여러분들 모두 환영합니다 Google의 논문인데 많이 아쉽네요"
['▁안녕', '하세요', '▁오늘', '▁논문', '▁읽', '기', '▁모임', '에', '▁참여한', '▁여러분', '들', '▁모두', '▁환영', '합', '니다', '▁G', 'oo', 'g', 'le', '의', '▁논문', '인데', '▁많이', '▁아쉽', '네요']
[22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221]
(bsz,) == (1,)
(bsz, seq_len) == (1, 25)
(bsz, seq_len) == (1, 25)
(bsz, seq_len, emb_dim) == (1, 25, 512)
bsz
seq_len
emb_dim
02. 이를 해결하기 위한 방법은?
How to implement?
"안녕하세요 오늘 논문 읽기 모임에 참여한 여러분들 모두 환영합니다 Google의 논문인데 많이 아쉽네요"
['▁안녕', '하세요', '▁오늘', '▁논문', '▁읽', '기', '▁모임', '에', '▁참여한', '▁여러분', '들', '▁모두', '▁환영', '합', '니다', '▁G', 'oo', 'g', 'le', '의', '▁논문', '인데', '▁많이', '▁아쉽', '네요']
[22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221]
(bsz,) == (1,)
(bsz, seq_len) == (1, 25)
(bsz, seq_len) == (1, 25)
bsz
seq_len
(bsz, padded_seq_len, emb_dim) == (1, 32, 512)
emb_dim
02. 이를 해결하기 위한 방법은?
How to implement?
bsz
seq_len (bsz, padded_seq_len, emb_dim) == (1, 32, 512)
emb_dim
(bsz*num_heads, padded_seq_len, head_dim) == (8, 32, 64)
02. 이를 해결하기 위한 방법은?
How to implement?
Level 0
(8, 32, 64)
Level 1
(8, 16, 64)
Level 2
(8, 8, 64)
Level 3
(8, 4, 64)
02. 이를 해결하기 위한 방법은?
How to implement?
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾
ሚ
𝐴𝑙 ෨
𝑌𝑙 𝑉𝑊𝑖
𝑉
Flip!
bsz*num_heads,
(seq_len/2**level) / block_size,
block_size,
head_dim
Level 3
Q, K, V
8 = 1 * 8
2 = (32 / (2**3)) / 2
2 = 2
64 = 512 / 8
bsz*num_heads,
seq_len/2**level,
head_dim
Y
8 = 1 * 8
4 = 32 / (2**3)
64 = 512 / 8
02. 이를 해결하기 위한 방법은?
How to implement?
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾
ሚ
𝐴𝑙 ෨
𝑌𝑙 𝑉𝑊𝑖
𝑉
Flip!
bsz*num_heads,
(seq_len/2**level) / block_size,
block_size,
head_dim
Level 2
Q, K, V
8 = 1 * 8
4 = (32 / 4) / 2
2 = 2
64 = 512 / 8
bsz*num_heads,
seq_len/2**level,
head_dim
Y
8 = 1 * 8
8 = 32 / (2**2)
64 = 512 / 8
02. 이를 해결하기 위한 방법은?
How to implement?
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾
ሚ
𝐴𝑙 ෨
𝑌𝑙 𝑉𝑊𝑖
𝑉
Flip!
bsz*num_heads,
(seq_len/2**level) / block_size,
block_size,
head_dim
Level 1
Q, K, V
8 = 1 * 8
8 = (32 / (2**1)) / 2
2 = 2
64 = 512 / 8
bsz*num_heads,
seq_len/2**level,
head_dim
Y
8 = 1 * 8
16 = 32 / (2**1)
64 = 512 / 8
02. 이를 해결하기 위한 방법은?
How to implement?
Level 3
02. 이를 해결하기 위한 방법은?
How to implement?
Level 3 + Level 2
02. 이를 해결하기 위한 방법은?
How to implement?
Level 3 + Level 2 + Level 1
02. 이를 해결하기 위한 방법은?
How to implement?
Level 3 + Level 2 + Level 1 + Level 0
Block size = 2
02. 이를 해결하기 위한 방법은?
How to implement?
Block size = 4
Level 2 + Level 1 + Level 0
02. 이를 해결하기 위한 방법은?
How to implement?
Block size = 8
Level 1 + Level 0
논문과 구현이 다른 부분 ISSUE로 남길 예정이에요!
03.
실제로 잘 됐는가? 한계는?
실험 Settings 및 Datasets 소개
논문 실험 setting 및 LRA, 1B Words Datasets에 대해 소개합니다
Report Experiments and Results
논문에서 보고한 실험 결과에 대해 보고합니다.
Limitations and Future Work
너무나도 명확한 논문의 한계에 대해 논의하고 향후 연구에 대해 얘기합니다.
03. 실제로 잘 됐는가? 한계는?
실험 settings 및 Datasets 소개
Long-Range Arena (LRA)
- evaluate transformer-based models in a systematic way
- generalization power, computational efficiency, memory foot-print
Task
- ListOps : 긴 수학적 표현
- Text : 텍스트 분류
- Retrieval : document retrieval
- Image : CIFAR-10 이미지 → flattened sequence ⇒ 분류
- Pathfinder : 원 2개와 점선 이미지 → flattened ⇒ 공간 배치 분류
- Path-X : Pathfinder의 확장 (32x32 → 128x128)
Benchmark 1: LRA
03. 실제로 잘 됐는가? 한계는?
실험 settings 및 Datasets 소개
https://github.com/redna11/lra-igloo
https://github.com/google-research/long-range-arena
https://arxiv.org/pdf/2106.01540.pdf
𝒪(𝐿2
)
−
𝒪(𝐿𝑀)
𝒪(𝐿 𝐿)
𝒪(𝐿(𝐾 + 𝑀))
𝒪(𝐿2
)
𝒪(𝐿logL)
𝒪(𝐵2
)
𝒪(𝐿2
)
𝒪(𝐿)
𝒪(𝐿)
𝒪(𝐿)
𝒪(𝐿)
?
𝒪(𝐿)
Models ListOps Text Retrieval Image Pathfinder Path-X Avg
Chance 10.00 50.00 50.00 10.00 50.00 50.00 44.00
Transformer 36.37 64.27 57.46 42.44 71.40 FAIL 54.39
Local Attention 15.82 52.98 53.39 41.46 66.63 FAIL 46.06
Sparse Transformer 17.07 63.58 59.59 44.24 71.71 FAIL 51.24
LongFormer 35.63 62.85 56.89 42.22 69.71 FAIL 53.46
Linformer 35.70 53.94 52.27 38.56 76.34 FAIL 51.36
Reformer 37.27 56.10 53.40 38.07 68.50 FAIL 50.67
Sinkhorn Transformer 33.67 61.20 53.83 41.23 67.45 FAIL 51.39
Synthesizer 36.99 61.68 54.67 41.61 69.45 FAIL 52.88
BigBird 36.05 64.02 59.29 40.83 74.87 FAIL 55.01
Linear Transformer 16.13 65.90 53.09 42.34 75.30 FAIL 50.55
Performer 18.01 65.40 53.82 42.77 77.05 FAIL 51.41
Luna-16 36.96 64.25 78.93 45.41 77.21 FAIL 60.55
Luna-128 37.13 64.38 79.15 47.40 77.67 FAIL 61.15
Luna-256 37.25 64.57 79.29 47.38 77.72 FAIL 61.24
IGLOO 39.23 82.00 75.50 47.00 67.50 NA 62.25
H-Transformer-1D 49.53 78.69 63.99 46.05 68.78 FAIL 61.41
03. 실제로 잘 됐는가? 한계는?
Report Experiments and Results
Luna에 대해 빈약한 실험... IGLOO 대비 떨어지는 Performance...
03. 실제로 잘 됐는가? 한계는?
Report Experiments and Results
Benchmark 2: LM trained on 1B words
• 1-billion words benchmark
• Transformer baseline : Flax 의 기본 transformer decoder
03. 실제로 잘 됐는가? 한계는?
Limitations and Future work
• Title: “Fast One-Dimensional Hierarchical Attention For Sequence”
• 그러나 연산 효율에 대한 실험 X...
• Computational Complexity가 linear인 알고리즘은 이제는 많다! 결과로 보여주라 (Luna처럼)
• Path-X task에서 결국 FAIL
• 뭔가 보여줄 수 있을 줄 알았으나 결국 이전 논문의 결과와 동일
• Inductive Bias가 그래서 결국 무엇인가?
• Attention Matrix가 Hierarchical Low-Rank Structure를 가질 것이라고 주장
• 이에 대한 근거는 실험 결과로 보여주겠다고 선언
• 그러나 실질적으로 public 1위 IGLOO한테 밀림...
• 결국 무엇을 보여주고 싶었던 걸까?
• Phil Wang님에 따르면, Cross-Attention이 불가능하다고 한다.
• Source와 Target 사이에 locality에 대한 개념이 없기 때문
• https://github.com/lucidrains/h-transformer-1d/issues/2
• Google의 Hot-Paper라기엔... 많이 아쉬웠던 논문
4
2 4
2
2
4
4
2
2
2
✓ Self-Attention Matrix을 Sparse하게 만드는데 Numerical Analysis를 접목
✓ 2D 모델에선 개선된 실험 세팅과 속도 비교가 있길...

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

[DL輪読会]Active Domain Randomization
[DL輪読会]Active Domain Randomization[DL輪読会]Active Domain Randomization
[DL輪読会]Active Domain Randomization
 
알아두면 쓸데있는 신비한 딥러닝 이야기
알아두면 쓸데있는 신비한 딥러닝 이야기알아두면 쓸데있는 신비한 딥러닝 이야기
알아두면 쓸데있는 신비한 딥러닝 이야기
 
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
[DL輪読会]大規模分散強化学習の難しい問題設定への適用[DL輪読会]大規模分散強化学習の難しい問題設定への適用
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
 
Transformerを多層にする際の勾配消失問題と解決法について
Transformerを多層にする際の勾配消失問題と解決法についてTransformerを多層にする際の勾配消失問題と解決法について
Transformerを多層にする際の勾配消失問題と解決法について
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
 
[모두의연구소] 쫄지말자딥러닝
[모두의연구소] 쫄지말자딥러닝[모두의연구소] 쫄지말자딥러닝
[모두의연구소] 쫄지말자딥러닝
 
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
 
【DL輪読会】A Path Towards Autonomous Machine Intelligence
【DL輪読会】A Path Towards Autonomous Machine Intelligence【DL輪読会】A Path Towards Autonomous Machine Intelligence
【DL輪読会】A Path Towards Autonomous Machine Intelligence
 
強化学習 DQNからPPOまで
強化学習 DQNからPPOまで強化学習 DQNからPPOまで
強化学習 DQNからPPOまで
 
인공지능추천시스템 airs개발기_모델링과시스템
인공지능추천시스템 airs개발기_모델링과시스템인공지능추천시스템 airs개발기_모델링과시스템
인공지능추천시스템 airs개발기_모델링과시스템
 
딥러닝 기본 원리의 이해
딥러닝 기본 원리의 이해딥러닝 기본 원리의 이해
딥러닝 기본 원리의 이해
 
Layout lm paper review
Layout lm paper review Layout lm paper review
Layout lm paper review
 
코로나19로 인해 변화된 우리 시대의 데이터 트랜드
코로나19로 인해 변화된 우리 시대의 데이터 트랜드코로나19로 인해 변화된 우리 시대의 데이터 트랜드
코로나19로 인해 변화된 우리 시대의 데이터 트랜드
 
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
빅데이터, 클라우드, IoT, 머신러닝. 왜 이렇게 많은 것들이 나타날까?
 
Deeplearning輪読会
Deeplearning輪読会Deeplearning輪読会
Deeplearning輪読会
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment Anything
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
 
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
 
Chat GPT_중학생 강의.pptx
Chat GPT_중학생 강의.pptxChat GPT_중학생 강의.pptx
Chat GPT_중학생 강의.pptx
 

Ähnlich wie H transformer-1d paper review!!

Design and Analysis Algorithms.pdf
Design and Analysis Algorithms.pdfDesign and Analysis Algorithms.pdf
Design and Analysis Algorithms.pdf
HarshNagda5
 

Ähnlich wie H transformer-1d paper review!! (20)

Dale Smith, Data Scientist, Nexidia at MLconf ATL - 9/18/15
Dale Smith, Data Scientist, Nexidia at MLconf ATL - 9/18/15Dale Smith, Data Scientist, Nexidia at MLconf ATL - 9/18/15
Dale Smith, Data Scientist, Nexidia at MLconf ATL - 9/18/15
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
Aspiring Minds | Automata
Aspiring Minds | Automata Aspiring Minds | Automata
Aspiring Minds | Automata
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick Prosser
 
QMC Community Software
QMC Community SoftwareQMC Community Software
QMC Community Software
 
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
 
SOLID principles
SOLID principlesSOLID principles
SOLID principles
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
DP Project Report
DP Project ReportDP Project Report
DP Project Report
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
 
Flow charts week 5 2020 2021
Flow charts week 5 2020  2021Flow charts week 5 2020  2021
Flow charts week 5 2020 2021
 
The deep bootstrap framework review
The deep bootstrap framework reviewThe deep bootstrap framework review
The deep bootstrap framework review
 
Nonlinear Programming: Theories and Algorithms of Some Unconstrained Optimiza...
Nonlinear Programming: Theories and Algorithms of Some Unconstrained Optimiza...Nonlinear Programming: Theories and Algorithms of Some Unconstrained Optimiza...
Nonlinear Programming: Theories and Algorithms of Some Unconstrained Optimiza...
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
Design and Analysis Algorithms.pdf
Design and Analysis Algorithms.pdfDesign and Analysis Algorithms.pdf
Design and Analysis Algorithms.pdf
 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem Formulation
 

Mehr von taeseon ryu

VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
taeseon ryu
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
taeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
taeseon ryu
 

Mehr von taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Kürzlich hochgeladen

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 

H transformer-1d paper review!!

  • 1. H-Transformer-1D: Fast One Dimensional Hierarchical Attention For Sequences 2021.08.15 자연어처리팀 진명훈 백지윤 신동진
  • 2. 00. 논문 및 저자 소개 Zhenhai Zhu NLU / CV / Numerical Analysis / Computer-aided Design https://ai.facebook.com/people/sinong-wang/ Belinda Z. Li NLP / NMT / Language Generation / Syntactic Parsing / Discourse Parsing https://belindal.github.io/ 2021.07.25
  • 3. 01. 문제가 뭐야? Transformers a.k.a. Self-Attention Self-attention 기반의 트랜스포머는 분야를 가리지 않고 정말 잘해요! Self-Attention Bottleneck 트랜스포머의 핵심 아이디어 Self-Attention의 계산 복잡도가 𝒪(𝑛2 )에요! Related Works 위 문제를 다른 논문에선 어떻게 접근했을까요?
  • 4. 01. 문제가 뭐야? Transformers a.k.a. Self-Attention • Attention은 RNN (Luong et al., 2015), CNN (Bello et al., 2019), GCN (Velickovic et al., 2018)의 핵심 block • Linearly combining information using content-based weights • 그 중 Multi-Head Scaled Dot-Product Attention은 다양한 이해, 생성 Tasks에서 SOTA를 달성 중인 Transformer (Vaswani et al., 2017)의 핵심 구조 • 진짜 만능... 어찌나 잘하던지 아래 Tasks들에서 전부 SOTA • Machine Translation, Document Classification, Entailment, Summarization, Question Answering (BigBird, Transformer-XL, Adaptive input repr for neural LM) • Music Generation (Music Transformer) • Image Generation (Generative pretraining from pixels, Image Transformer) • Genomics (BigBird, MLM for proteins via linearly scalable long-context transformers) • Transformer는 BERT (Devlin et al., 2019)와 GPT-3 (Brown et al., 2020)의 Backbone 모델이기도 함 Add & Norm Input Embedding Output Embedding Linear Softmax + + Multi-Head Attention Feed Forward Feed Forward Masked Multi-Head Attention Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Positional Encoding Positional Encoding Encoder Decoder Inputs Outputs (Shifted right) Outputs Probabilities
  • 5. 01. 문제가 뭐야? Transformers a.k.a. Self-Attention 𝑄𝑊 𝑖 𝑄 ∈ ℝ5×3 𝐾𝑊𝑖 𝐾 ∈ ℝ3×5 𝐴𝑡𝑡 ∈ ℝ5×5 𝑞1 𝑞2 𝑞3 𝑞4 </s> 𝑡1 𝑡2 𝑡3 𝑘1 𝑘2 𝑘3 𝑘4 𝑘5 𝑎11 𝑎12 𝑎13 𝑎14 𝑎15 𝑣1 𝑣2 𝑣3 𝑜11 𝑉𝑊𝑖 𝑉 ∈ ℝ5×3 𝑂 ∈ ℝ5×3 </s> 𝑠1 𝑠2 𝑠3 𝑠4 Embedding + Projection 𝑞5 𝑡4 Embedding + Projection Add & Norm Input Embedding Output Embedding Linear Softmax + + Multi-Head Attention Feed Forward Feed Forward Masked Multi-Head Attention Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Positional Encoding Positional Encoding Encoder Decoder Inputs Outputs (Shifted right) Outputs Probabilities Self-Attention softmax 𝑄𝑊 𝑖 𝑄 𝐾𝑊𝑖 𝐾 𝑇 𝑑 𝑉𝑊𝑖 𝑉
  • 6. 01. 문제가 뭐야? Transformers a.k.a. Self-Attention 𝑄𝑊 𝑖 𝑄 ∈ ℝ5×3 𝐾𝑊𝑖 𝐾 ∈ ℝ3×5 𝐴𝑡𝑡 ∈ ℝ5×5 𝑞1 𝑞2 𝑞3 𝑞4 </s> 𝑡1 𝑡2 𝑡3 𝑘1 𝑘2 𝑘3 𝑘4 𝑘5 𝑎11 𝑎12 𝑎13 𝑎14 𝑎15 𝑣1 𝑣2 𝑣3 𝑜11 𝑉𝑊𝑖 𝑉 ∈ ℝ5×3 𝑂 ∈ ℝ5×3 </s> 𝑠1 𝑠2 𝑠3 𝑠4 Embedding + Projection 𝑞5 𝑡4 Embedding + Projection Add & Norm Input Embedding Output Embedding Linear Softmax + + Multi-Head Attention Feed Forward Feed Forward Masked Multi-Head Attention Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Positional Encoding Positional Encoding Encoder Decoder Inputs Outputs (Shifted right) Outputs Probabilities Self-Attention softmax 𝑄𝑊 𝑖 𝑄 𝐾𝑊𝑖 𝐾 𝑇 𝑑 𝑉𝑊𝑖 𝑉
  • 7. 01. 문제가 뭐야? Transformers a.k.a. Self-Attention 𝑄𝑊 𝑖 𝑄 ∈ ℝ5×3 𝐾𝑊𝑖 𝐾 ∈ ℝ3×5 𝐴𝑡𝑡 ∈ ℝ5×5 𝑞1 𝑞2 𝑞3 𝑞4 </s> 𝑡1 𝑡2 𝑡3 𝑘1 𝑘2 𝑘3 𝑘4 𝑘5 𝑎11 𝑎12 𝑎13 𝑎14 𝑎15 𝑣1 𝑣2 𝑣3 𝑜11 𝑉𝑊𝑖 𝑉 ∈ ℝ5×3 𝑂 ∈ ℝ5×3 </s> 𝑠1 𝑠2 𝑠3 𝑠4 Embedding + Projection 𝑞5 𝑡4 Embedding + Projection Add & Norm Input Embedding Output Embedding Linear Softmax + + Multi-Head Attention Feed Forward Feed Forward Masked Multi-Head Attention Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Positional Encoding Positional Encoding Encoder Decoder Inputs Outputs (Shifted right) Outputs Probabilities Self-Attention softmax 𝑄𝑊 𝑖 𝑄 𝐾𝑊𝑖 𝐾 𝑇 𝑑 𝑉𝑊𝑖 𝑉
  • 8. 01. 문제가 뭐야? Self-Attention Bottleneck 𝑄𝑊 𝑖 𝑄 ∈ ℝ5×3 𝐾𝑊𝑖 𝐾 ∈ ℝ3×5 𝐴𝑡𝑡 ∈ ℝ5×5 𝑞1 𝑞2 𝑞3 𝑞4 </s> 𝑡1 𝑡2 𝑡3 𝑘1 𝑘2 𝑘3 𝑘4 𝑘5 𝑎11 𝑎12 𝑎13 𝑎14 𝑎15 𝑣1 𝑣2 𝑣3 𝑜11 𝑉𝑊𝑖 𝑉 ∈ ℝ5×3 𝑂 ∈ ℝ5×3 </s> 𝑠1 𝑠2 𝑠3 𝑠4 Embedding + Projection 𝑞5 𝑡4 Embedding + Projection Add & Norm Input Embedding Output Embedding Linear Softmax + + Multi-Head Attention Feed Forward Feed Forward Masked Multi-Head Attention Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Positional Encoding Positional Encoding Encoder Decoder Inputs Outputs (Shifted right) Outputs Probabilities 두 Matrix Multiplication이 Bottleneck!! softmax 𝑄𝑊 𝑖 𝑄 𝐾𝑊𝑖 𝐾 𝑇 𝑑 𝑉𝑊𝑖 𝑉 𝒪 𝐿2 𝑑
  • 9. 01. 문제가 뭐야? Self-Attention Bottleneck • 그러나 sequence 길이에 비례하여 quadratic operation을 요함 (∼ 𝒪 𝐿2 𝑑 ) • 이는 아주 긴(특히 1,000개 이상)의 토큰을 처리할 때 굉장히 심각한 bottleneck으로 작용한다고 한다. • Efficient Transformers: A survey (Tay et al., 2020d) Add & Norm Input Embedding Output Embedding Linear Softmax + + Multi-Head Attention Feed Forward Feed Forward Masked Multi-Head Attention Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Positional Encoding Positional Encoding Encoder Decoder Inputs Outputs (Shifted right) Outputs Probabilities
  • 10. 01. 문제가 뭐야? Related Works https://arxiv.org/abs/2009.06732 Subquadratic self-attention 연구가 많이 진행되었어요!
  • 11. 01. 문제가 뭐야? Related Works http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf time 𝒪 ℓ ℓ 𝒪 ℓlogℓ 𝒪 ℓlogℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ Sparse Transformer (Child et al., 2019) Routing Transformer (Roy et al., 2020) Linformer (Wang et al., 2020) Big Bird (Zaheer et al., 2020) Reformer (Kitaev et al., 2020) Performer (Choromanski et al., 2020) Linear Transformer (Katharopoulos et al., 2020)
  • 12. 01. 문제가 뭐야? Related Works Keys Queries Goal: 효과적인 연산을 위해 attention 연산을 근사!
  • 13. 01. 문제가 뭐야? Related Works Keys Queries • Data-Independent Patterns • Blockwise Transformer • Sparse Transformer • Longformer • Big Bird http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
  • 14. 01. 문제가 뭐야? Related Works Keys Queries • Data-Independent Patterns • Data-Dependent Patterns • Linformer • Reformer • Routing Transformer • Clustered Attention • Sinkhorn Transformer http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
  • 15. 01. 문제가 뭐야? Related Works • Data-Independent Patterns • Data-Dependent Patterns • Kernels and Alternative Attention Mechanisms • Linear Transformer • Random Feature Attention • Performer • Synthesizer • Recurrence • Transformer-XL • Compressive Transformers http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
  • 16. 01. 문제가 뭐야? Related Works • Data-Independent Patterns Blockwise Strided Diagonal Random Global http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf Blockwise Transformer Local Attention Sparse Transformer Longformer Big Bird Longformer Big Bird Big Bird Longformer ETC
  • 17. • Data-dependent Patterns 01. 문제가 뭐야? Related Works http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf Buckets Buckets: Hasing Sorting and blocking Compression Buckets: Clustering
  • 18. 01. 문제가 뭐야? Related Works http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
  • 19. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid Method + Intuition Attention 행렬의 Sparsity를 다루기 위한 Numerical 방법을 소개합니다 Hierarchical Attention and its computational complexity 위 개념을 적용한 H-Transformer-1D 알고리즘과 복잡도롤 소개합니다 How to implement? Lucidrain님의 구현체를 참고하여 torch로 어떻게 구현했는지 확인합니다
  • 20. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition Q) 왜 꼭 어떤 pattern을 0으로 지워야 하는가? 더 수학적(numerical analysis)으로 행렬을 Sparse하게 만드는 기법이 많은데! 그 중 Hierarchical–Matrix를 사용하면 특별히 잘 되지 않을까?
  • 21. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition 𝑍 = softmax 𝑄 𝐾 𝑇 𝑑 𝑉 𝑍 = 𝐷−1 𝐴𝑉 𝐴 = 𝑒𝑆 Self-Attention을 위의 수식처럼 쓸게요!
  • 22. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition 𝑄 ∈ ℝ16×𝑑 𝐾 ∈ ℝ𝑑×16 𝐴 ∈ ℝ16×16 ≈ Self-Attention is low-rank matrix
  • 23. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition 𝐴𝑖,𝑗 = 𝑒𝑆𝑖,𝑗 𝑆𝑖,𝑗 = 2𝑒− 𝑖−𝑗 2 − 1 2.7183 0.7678 0.3816 0.3680 0.3679 • 왼쪽의 행렬로 예시를 들어보자! • Sequence length는 16 • Looser tolerance를 10−1 로 줘도 행렬 𝐴는 Full-rank임 • 즉, 표준적인 low-rank approximation은 효과적이지 못함
  • 24. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition 𝐴𝑖,𝑗 = 𝑒𝑆𝑖,𝑗 𝑆𝑖,𝑗 = 2𝑒− 𝑖−𝑗 2 − 1 2.7183 0.7678 0.3816 0.3680 0.3679 • 이걸 두 level의 matrix block으로 분리해보자! (hierarchy하게) • 그러면 Looser tolerance를 10−3 에서 각 block 행렬의 rank는 아래와 같아요! • Off-diagonal term에도 value가 있기에 함부로 무시하면 성능에 영향 을 끼침 (poor approximation) • 우리의 방법은 level을 높여서 compression rate를 높이는 것도 가능! 4 2 4 2 2 4 4 2 2 2
  • 25. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition 𝑄 ∈ ℝ16×𝑑 𝐾 ∈ ℝ𝑑×16 𝐴 ∈ ℝ16×16 4 2 4 2 2 4 4 2 2 2 ≈ 4 2 4 2 2 4 4 2 2 2
  • 26. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition Multigrid method! Multi-level nested iterative method for solving large-scale sparse matrices Results from discretized partial differential equations https://www.cambridge.org/core/books/introduction-to-numerical-geodynamic-modelling/multigrid-method/D8858D6C897D3AC0F44E6C296E86585F
  • 27. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition https://www.researchgate.net/figure/Illustration-of-the-multigrid-V-cycle_fig2_328599327 coarsening Simple Average Simple duplicate
  • 28. 02. 이를 해결하기 위한 방법은? H-Matrix and Multigrid method + Intuition Hierarchical Structure → Inductive Bias 4 2 4 2 2 4 4 2 2 2 • 논문에선 다음과 같이 설명 • Sharp Nearby, Fuzzy far away! • 문장이 길어지면 순서는 성능에 거의 영향을 미치지 않음 • 모델은 멀리 떨어진 단어에 대해 high-level 대략적인 표현을 유지 • Multilevel method의 핵심 아이디어는 아래임! • perform no approximation for near interactions, and apply progressively lower-precision approximation for progressively longer distance interactions • Full-rank diagonal blocks (near interactions) • Higher precision approx. for 4X4 off-diagonal (mid-distance) • Low precision approx. for 8X8 off-diagonal (long-distance) • Inductive Bias: 우리가 세운 가설 ㅎㅎ • Attention matrix는 hierarchical low-rank 구조로 이뤄져 있을 거야! • Good Benchmark Performance가 이를 뒷받침하지! • 그런 것 치고 실험 세팅이... Hmm...
  • 29. 02. 이를 해결하기 위한 방법은? Hierarchical Attention and its computational complexity time 𝒪 ℓ ℓ 𝒪 ℓlogℓ 𝒪 ℓlogℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ Sparse Transformer (Child et al., 2019) Routing Transformer (Roy et al., 2020) Linformer (Wang et al., 2020) Big Bird (Zaheer et al., 2020) Reformer (Kitaev et al., 2020) Performer (Choromanski et al., 2020) Linear Transformer (Katharopoulos et al., 2020) Luna: Linear Unified Nested Attention (Xuezhe ma et al., 2021) H-Transformer-1D (Zhenhai Zhu et al., 2021) 𝒪 ℓ
  • 30. 02. 이를 해결하기 위한 방법은? How to implement? Q) 이거는 어떻게 구현할 수 있을까? 실제로 실험 결과가 Inductive Bias와 속도를 지지할까?
  • 31. 02. 이를 해결하기 위한 방법은? How to implement? https://github.com/lucidrains/h-transformer-1d/issues/1
  • 32. 02. 이를 해결하기 위한 방법은? How to implement? https://github.com/lucidrains/h-transformer-1d/commit/5660abdab7c7359c9c178032c73f37a3241937d1 ⊕ RotaryEmbedding ⊕ Reversible Residual Connection
  • 33. 02. 이를 해결하기 위한 방법은? How to implement? "안녕하세요 오늘 논문 읽기 모임에 참여한 여러분들 모두 환영합니다 Google의 논문인데 많이 아쉽네요" ['▁안녕', '하세요', '▁오늘', '▁논문', '▁읽', '기', '▁모임', '에', '▁참여한', '▁여러분', '들', '▁모두', '▁환영', '합', '니다', '▁G', 'oo', 'g', 'le', '의', '▁논문', '인데', '▁많이', '▁아쉽', '네요'] [22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221] (bsz,) == (1,) (bsz, seq_len) == (1, 25) (bsz, seq_len) == (1, 25)
  • 34. 02. 이를 해결하기 위한 방법은? How to implement? "안녕하세요 오늘 논문 읽기 모임에 참여한 여러분들 모두 환영합니다 Google의 논문인데 많이 아쉽네요" ['▁안녕', '하세요', '▁오늘', '▁논문', '▁읽', '기', '▁모임', '에', '▁참여한', '▁여러분', '들', '▁모두', '▁환영', '합', '니다', '▁G', 'oo', 'g', 'le', '의', '▁논문', '인데', '▁많이', '▁아쉽', '네요'] [22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221] (bsz,) == (1,) (bsz, seq_len) == (1, 25) (bsz, seq_len) == (1, 25) (bsz, seq_len, emb_dim) == (1, 25, 512) bsz seq_len emb_dim
  • 35. 02. 이를 해결하기 위한 방법은? How to implement? "안녕하세요 오늘 논문 읽기 모임에 참여한 여러분들 모두 환영합니다 Google의 논문인데 많이 아쉽네요" ['▁안녕', '하세요', '▁오늘', '▁논문', '▁읽', '기', '▁모임', '에', '▁참여한', '▁여러분', '들', '▁모두', '▁환영', '합', '니다', '▁G', 'oo', 'g', 'le', '의', '▁논문', '인데', '▁많이', '▁아쉽', '네요'] [22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221] (bsz,) == (1,) (bsz, seq_len) == (1, 25) (bsz, seq_len) == (1, 25) bsz seq_len (bsz, padded_seq_len, emb_dim) == (1, 32, 512) emb_dim
  • 36. 02. 이를 해결하기 위한 방법은? How to implement? bsz seq_len (bsz, padded_seq_len, emb_dim) == (1, 32, 512) emb_dim (bsz*num_heads, padded_seq_len, head_dim) == (8, 32, 64)
  • 37. 02. 이를 해결하기 위한 방법은? How to implement? Level 0 (8, 32, 64) Level 1 (8, 16, 64) Level 2 (8, 8, 64) Level 3 (8, 4, 64)
  • 38. 02. 이를 해결하기 위한 방법은? How to implement? 𝑄𝑊 𝑖 𝑄 𝐾𝑊𝑖 𝐾 ሚ 𝐴𝑙 ෨ 𝑌𝑙 𝑉𝑊𝑖 𝑉 Flip! bsz*num_heads, (seq_len/2**level) / block_size, block_size, head_dim Level 3 Q, K, V 8 = 1 * 8 2 = (32 / (2**3)) / 2 2 = 2 64 = 512 / 8 bsz*num_heads, seq_len/2**level, head_dim Y 8 = 1 * 8 4 = 32 / (2**3) 64 = 512 / 8
  • 39. 02. 이를 해결하기 위한 방법은? How to implement? 𝑄𝑊 𝑖 𝑄 𝐾𝑊𝑖 𝐾 ሚ 𝐴𝑙 ෨ 𝑌𝑙 𝑉𝑊𝑖 𝑉 Flip! bsz*num_heads, (seq_len/2**level) / block_size, block_size, head_dim Level 2 Q, K, V 8 = 1 * 8 4 = (32 / 4) / 2 2 = 2 64 = 512 / 8 bsz*num_heads, seq_len/2**level, head_dim Y 8 = 1 * 8 8 = 32 / (2**2) 64 = 512 / 8
  • 40. 02. 이를 해결하기 위한 방법은? How to implement? 𝑄𝑊 𝑖 𝑄 𝐾𝑊𝑖 𝐾 ሚ 𝐴𝑙 ෨ 𝑌𝑙 𝑉𝑊𝑖 𝑉 Flip! bsz*num_heads, (seq_len/2**level) / block_size, block_size, head_dim Level 1 Q, K, V 8 = 1 * 8 8 = (32 / (2**1)) / 2 2 = 2 64 = 512 / 8 bsz*num_heads, seq_len/2**level, head_dim Y 8 = 1 * 8 16 = 32 / (2**1) 64 = 512 / 8
  • 41. 02. 이를 해결하기 위한 방법은? How to implement? Level 3
  • 42. 02. 이를 해결하기 위한 방법은? How to implement? Level 3 + Level 2
  • 43. 02. 이를 해결하기 위한 방법은? How to implement? Level 3 + Level 2 + Level 1
  • 44. 02. 이를 해결하기 위한 방법은? How to implement? Level 3 + Level 2 + Level 1 + Level 0 Block size = 2
  • 45. 02. 이를 해결하기 위한 방법은? How to implement? Block size = 4 Level 2 + Level 1 + Level 0
  • 46. 02. 이를 해결하기 위한 방법은? How to implement? Block size = 8 Level 1 + Level 0 논문과 구현이 다른 부분 ISSUE로 남길 예정이에요!
  • 47. 03. 실제로 잘 됐는가? 한계는? 실험 Settings 및 Datasets 소개 논문 실험 setting 및 LRA, 1B Words Datasets에 대해 소개합니다 Report Experiments and Results 논문에서 보고한 실험 결과에 대해 보고합니다. Limitations and Future Work 너무나도 명확한 논문의 한계에 대해 논의하고 향후 연구에 대해 얘기합니다.
  • 48. 03. 실제로 잘 됐는가? 한계는? 실험 settings 및 Datasets 소개 Long-Range Arena (LRA) - evaluate transformer-based models in a systematic way - generalization power, computational efficiency, memory foot-print Task - ListOps : 긴 수학적 표현 - Text : 텍스트 분류 - Retrieval : document retrieval - Image : CIFAR-10 이미지 → flattened sequence ⇒ 분류 - Pathfinder : 원 2개와 점선 이미지 → flattened ⇒ 공간 배치 분류 - Path-X : Pathfinder의 확장 (32x32 → 128x128) Benchmark 1: LRA
  • 49. 03. 실제로 잘 됐는가? 한계는? 실험 settings 및 Datasets 소개 https://github.com/redna11/lra-igloo https://github.com/google-research/long-range-arena https://arxiv.org/pdf/2106.01540.pdf 𝒪(𝐿2 ) − 𝒪(𝐿𝑀) 𝒪(𝐿 𝐿) 𝒪(𝐿(𝐾 + 𝑀)) 𝒪(𝐿2 ) 𝒪(𝐿logL) 𝒪(𝐵2 ) 𝒪(𝐿2 ) 𝒪(𝐿) 𝒪(𝐿) 𝒪(𝐿) 𝒪(𝐿) ? 𝒪(𝐿) Models ListOps Text Retrieval Image Pathfinder Path-X Avg Chance 10.00 50.00 50.00 10.00 50.00 50.00 44.00 Transformer 36.37 64.27 57.46 42.44 71.40 FAIL 54.39 Local Attention 15.82 52.98 53.39 41.46 66.63 FAIL 46.06 Sparse Transformer 17.07 63.58 59.59 44.24 71.71 FAIL 51.24 LongFormer 35.63 62.85 56.89 42.22 69.71 FAIL 53.46 Linformer 35.70 53.94 52.27 38.56 76.34 FAIL 51.36 Reformer 37.27 56.10 53.40 38.07 68.50 FAIL 50.67 Sinkhorn Transformer 33.67 61.20 53.83 41.23 67.45 FAIL 51.39 Synthesizer 36.99 61.68 54.67 41.61 69.45 FAIL 52.88 BigBird 36.05 64.02 59.29 40.83 74.87 FAIL 55.01 Linear Transformer 16.13 65.90 53.09 42.34 75.30 FAIL 50.55 Performer 18.01 65.40 53.82 42.77 77.05 FAIL 51.41 Luna-16 36.96 64.25 78.93 45.41 77.21 FAIL 60.55 Luna-128 37.13 64.38 79.15 47.40 77.67 FAIL 61.15 Luna-256 37.25 64.57 79.29 47.38 77.72 FAIL 61.24 IGLOO 39.23 82.00 75.50 47.00 67.50 NA 62.25 H-Transformer-1D 49.53 78.69 63.99 46.05 68.78 FAIL 61.41
  • 50. 03. 실제로 잘 됐는가? 한계는? Report Experiments and Results Luna에 대해 빈약한 실험... IGLOO 대비 떨어지는 Performance...
  • 51. 03. 실제로 잘 됐는가? 한계는? Report Experiments and Results Benchmark 2: LM trained on 1B words • 1-billion words benchmark • Transformer baseline : Flax 의 기본 transformer decoder
  • 52. 03. 실제로 잘 됐는가? 한계는? Limitations and Future work • Title: “Fast One-Dimensional Hierarchical Attention For Sequence” • 그러나 연산 효율에 대한 실험 X... • Computational Complexity가 linear인 알고리즘은 이제는 많다! 결과로 보여주라 (Luna처럼) • Path-X task에서 결국 FAIL • 뭔가 보여줄 수 있을 줄 알았으나 결국 이전 논문의 결과와 동일 • Inductive Bias가 그래서 결국 무엇인가? • Attention Matrix가 Hierarchical Low-Rank Structure를 가질 것이라고 주장 • 이에 대한 근거는 실험 결과로 보여주겠다고 선언 • 그러나 실질적으로 public 1위 IGLOO한테 밀림... • 결국 무엇을 보여주고 싶었던 걸까? • Phil Wang님에 따르면, Cross-Attention이 불가능하다고 한다. • Source와 Target 사이에 locality에 대한 개념이 없기 때문 • https://github.com/lucidrains/h-transformer-1d/issues/2 • Google의 Hot-Paper라기엔... 많이 아쉬웠던 논문 4 2 4 2 2 4 4 2 2 2 ✓ Self-Attention Matrix을 Sparse하게 만드는데 Numerical Analysis를 접목 ✓ 2D 모델에선 개선된 실험 세팅과 속도 비교가 있길...