SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Transformer-XL: Attentive Language Models
Beyond a Fixed-Length Context
San Kim
2019. 08. 23
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
Transformer architecture [1]
Transformer architecture [1]
Transformer architecture [1]
[CLS] family isn ' t about whose blood you have . it ' s about who you care about . [SEP] [pad] [pad] [pad]
101 11214 65148 112 162 10935 17060 15465 10855 10574 119 10197 112 161 10935 10488 10855 11258 10935 119 102 0 0 0
0.01 -0.05 0.05 -0.05 -0.03 0 0.05 -0.03 -0.03 -0.01 -0.04 -0.04 -0.05 -0.04 0 -0.03 -0.03 -0.04 0 -0.04 -0.01 0.08 0.08 0.08
-0.01 0.03 0.04 0.04 -0.09 0.07 -0.01 -0.06 -0.01 -0.03 -0.01 -0.01 0.04 -0.01 0.07 -0.04 -0.01 -0.06 0.07 -0.01 -0.02 0.04 0.04 0.04
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
0.01 0.03 0.02 -0.02 0.01 0.06 0.08 -0.05 -0.01 -0.05 0.03 0.04 -0.02 0.05 0.06 -0.03 -0.01 0.08 0.06 0.03 0 0.01 0.01 0.01
0 0 0.1 -0.04 0.05 -0.01 -0.1 -0.06 0.04 -0.04 0.02 0.04 -0.04 0.04 -0.01 -0.03 0.04 -0.09 -0.01 0.02 -0.01 0 0 0
Input tokens
Input ids
Embedding
๐‘‘๐‘’๐‘š๐‘
๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘›
Transformer architecture [1]
-0.01 -0.01 0 -0.04 0.03 ... -0.03 0 -0.03
0 0 -0.01 0 0.01 ... 0.01 0 0
... ... ... ... ... ... ... ... ...
-0.01 0 -0.01 0 0 ... 0.01 0.01 0
0 -0.03 -0.02 -0.01 -0.01 ... -0.02 -0.02 -0.03
Positional Encoding (sinusoidal or learned positional embedding)
[CLS] family isn ' t ... care about .
0.01 -0.05 0.05 -0.05 -0.03 ... -0.04 0 -0.04
-0.01 0.03 0.04 0.04 -0.09 ... -0.06 0.07 -0.01
... ... ... ... ... ... ... ... ...
0.01 0.03 0.02 -0.02 0.01 ... 0.08 0.06 0.03
0 0 0.1 -0.04 0.05 ... -0.09 -0.01 0.02
Word embedding
[CLS] family isn ' t ... care about .
0 -0.06 0.05 -0.09 0 ... -0.06 0 -0.07
-0.01 0.04 0.04 0.05 -0.08 ... -0.04 0.08 -0.01
... ... ... ... ... ... ... ... ...
0.01 0.02 0.01 -0.02 0 ... 0.09 0.06 0.03
0 -0.03 0.07 -0.05 0.04 ... -0.11 -0.03 -0.01
Positional Encoding + Word embedding
๐‘‘ ๐‘’๐‘š๐‘ ร— ๐‘ ๐‘’๐‘ž๐‘™๐‘’๐‘›
Transformer architecture [1]
Sinusoidal positional encoding
โ€ข For any fixed offset ๐‘˜, ๐‘ƒ๐ธ ๐‘๐‘œ๐‘ +๐‘˜ can be
represented as a linear function of ๐‘ƒ๐ธ ๐‘๐‘œ๐‘ .
Transformer architecture [1]
Sinusoidal positional embeddings
A model trained on a memory of some certain
length can automatically generalize to a memory
several times long during evaluation.
learned positional embeddings
Transformer architecture [1]
[CLS] family ... [pad] [pad]
0.09 -0.63 ... 2.26 0.96
-0.14 0.55 ... 0.95 1.07
... ... ... ... ...
0.01 0.35 ... 0.11 0.29
-0.02 -0.4 ... -0.39 -0.21
Positional Encoding + Word embedding
๐‘‘ ๐‘’๐‘š๐‘ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘›
๐ธ๐‘š๐‘ ๐‘ž
0.02 0.04 0 ... 0.01 0.05 -0.1
-0.02 0.07 0.03 ... 0.03 -0.02 0
... ... ... ... ... ... ...
0.1 -0.06 -0.02 ... -0.09 -0.04 -0.02
0.05 0.04 0.06 ... 0 -0.05 0.01
๐‘‘ ๐‘ž ร— ๐‘‘ ๐‘’๐‘š๐‘
๐‘Š๐‘ž
-0.63 -0.17 ... 0.06 -0.61
1 ร— ๐‘‘ ๐‘ž
๐ต๐‘ž
๐‘„ ๐‘๐‘Ÿ๐‘œ๐‘— = ๐ธ๐‘š๐‘ ๐‘ž
๐‘‡
๐‘Š๐‘ž
๐‘‡
+ ๐ต๐‘ž
๐‘‡
๐พ๐‘๐‘Ÿ๐‘œ๐‘— = ๐ธ๐‘š๐‘ ๐‘˜
๐‘‡
๐‘Š๐‘˜
๐‘‡ + ๐ต ๐‘˜
๐‘‡
๐‘‰๐‘๐‘Ÿ๐‘œ๐‘— = ๐ธ๐‘š๐‘ ๐‘ฃ
๐‘‡
๐‘Š๐‘ฃ
๐‘‡ + ๐ต๐‘ฃ
๐‘‡
Transformer architecture [1]
[CLS] family ... [pad] [pad]
-0.76 -1.25 ... 0.38 -0.01
0 0.29 ... -1.9 -1.33
... ... ... ... ...
-1.54 1.33 ... 1.56 -0.12
-1.48 -0.11 ... 0.24 -0.24
๐‘„ ๐‘๐‘Ÿ๐‘œ๐‘—
๐‘‘ ๐‘ž ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘›
[CLS] family ... [pad] [pad]
-3.36 -1.59 ... 1.27 1.99
-2.38 -1.82 ... -0.53 -0.29
... ... ... ... ...
1.13 1.66 ... -0.1 1.36
-2.61 -0.89 ... 1.77 1.88
๐พ ๐‘๐‘Ÿ๐‘œ๐‘—
๐‘‘ ๐‘˜ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘›
๐‘Ž๐‘ก๐‘ก๐‘’๐‘›๐‘ก๐‘–๐‘œ๐‘›_๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’๐‘  =
๐‘„ ๐‘๐‘Ÿ๐‘œ๐‘—
๐‘‡
๐พ๐‘๐‘Ÿ๐‘œ๐‘—
๐‘‘ ๐‘ž
๐‘‘ ๐‘ž = ๐‘‘ ๐‘˜
Transformer architecture [1]
Transformer architecture [1]
๐‘Ž๐‘ก๐‘ก_๐‘๐‘Ÿ๐‘œ๐‘๐‘  = ๐‘ ๐‘š(๐‘Ž๐‘ก๐‘ก_๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’๐‘ )
Transformer architecture [1]
[CLS] family ... [pad] [pad]
0.86 0 ... 0 0
0.06 0.04 ... 0 0
... ... ... ... ...
0.15 0 ... 0 0
0.19 0 ... 0 0
๐‘Ž๐‘ก๐‘ก_๐‘๐‘Ÿ๐‘œ๐‘๐‘ 
[CLS] family ... [pad] [pad]
-0.2 -1.08 ... -0.24 -0.26
0.09 -1.09 ... 0.81 0.79
... ... ... ... ...
0.28 -0.15 ... 0.74 0.76
0.07 0.72 ... 0.55 0.56
๐‘‰๐‘๐‘Ÿ๐‘œ๐‘—
๐‘‘ ๐‘ฃ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ฃ
๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ž ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘˜
๐‘๐‘œ๐‘›๐‘ก๐‘’๐‘ฅ๐‘ก_๐‘ฃ๐‘’๐‘๐‘ก๐‘œ๐‘Ÿ๐‘  = ๐‘Ž๐‘ก๐‘ก ๐‘๐‘Ÿ๐‘œ๐‘๐‘  ร— ๐‘‰๐‘๐‘Ÿ๐‘œ๐‘—
๐‘‡
๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘˜ = ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ฃ
Transformer architecture [1]
Family isn't about whose blood you have. It's about who you care about.
๐‘ฆ๐‘œ๐‘ข ๐‘ž = 0.63 ร— โ„Ž๐‘Ž๐‘ฃ๐‘’ ๐‘ฃ + 0.08 ร— ๐‘๐‘™๐‘œ๐‘œ๐‘‘ ๐‘ฃ + 0.07 ร— ๐‘คโ„Ž๐‘œ๐‘ ๐‘’ ๐‘ฃ
๐‘คโ„Ž๐‘œ ๐‘ž = 0.32 ร— ๐‘ฆ๐‘œ๐‘ข ๐‘ฃ + 0.24 ร— ๐‘Ž๐‘๐‘œ๐‘ข๐‘ก ๐‘ฃ + 0.11 ร— โ„Ž๐‘Ž๐‘ฃ๐‘’ ๐‘ฃ
๐‘ฆ๐‘œ๐‘ข ๐‘ž = 0.34 ร— ๐‘คโ„Ž๐‘œ ๐‘ฃ + 0.27 ร— ๐‘๐‘Ž๐‘Ÿ๐‘’ ๐‘ฃ
Transformer architecture [1]
๐‘๐‘œ๐‘›๐‘ก๐‘’๐‘ฅ๐‘ก_๐‘ฃ๐‘’๐‘๐‘ก๐‘œ๐‘Ÿ๐‘ 
[CLS] family ... [pad] [pad]
-0.19 -0.25 ... 0.2 0.28
0.01 -0.33 ... -0.12 0.11
... ... ... ... ...
0.23 -0.09 ... 0.24 0.33
0.05 0.15 ... -0.26 -0.29
๐‘‘ ๐‘ฃ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ž
[CLS] family ... [pad] [pad]
-0.03 -0.21 ... 0.81 -0.04
0.04 1.17 ... 0.36 0.44
... ... ... ... ...
-0.02 0.6 ... -0.18 -0.03
0.01 -0.28 ... -0.28 -0.15
๐‘Ž๐‘ก๐‘ก_๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘œ๐‘ข๐‘ก
๐‘’๐‘š๐‘ ๐‘‘ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ž
๐‘Ž๐‘ก๐‘ก_๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘œ๐‘ข๐‘ก = ๐‘๐‘œ๐‘›๐‘๐‘Ž๐‘ก( ๐‘๐‘œ๐‘›๐‘ก๐‘’๐‘ฅ๐‘ก ๐‘ฃ๐‘’๐‘๐‘ก๐‘œ๐‘Ÿ๐‘  ๐‘– i โˆˆ 0, ๐‘ , ๐‘– < ๐‘›๐‘ข๐‘šโ„Ž๐‘’๐‘Ž๐‘‘
Transformer architecture [1]
๐น๐น๐‘ = max 0, ๐‘ฅ๐‘Š1 + ๐‘1 ๐‘Š2 + ๐‘2
๐‘ฅ
seq_len ๐‘ž ร— ๐‘‘ ๐‘’๐‘š๐‘
๐‘Š1
๐‘‘ ๐‘’๐‘š๐‘ ร— ๐‘‘๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ
๐‘1
1 ร— ๐‘‘๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ
๐‘Š2
๐‘‘๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ ร— ๐‘‘ ๐‘’๐‘š๐‘
๐‘2
1 ร— ๐‘‘ ๐‘’๐‘š๐‘
๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘๐‘œ๐‘Ÿ๐‘š ๐‘ฅ + ๐‘‘๐‘Ÿ๐‘œ๐‘๐‘œ๐‘ข๐‘ก ๐‘ ๐‘ข๐‘๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ ๐‘ฅ
Transformer architecture [1]
The memory cost for scaled
dot-product attention is
quadratic w.r.t. the sequence
length.
Transformer architecture [1]
Pros
โ€ข Itโ€™s enables the learning of long-term dependency.
โ€ข It is less affected by the gradient vanishing problem compared to RNN.
Cons
โ€ข The model cannot capture any longer-dependency beyond the predefined context
length.
โ€ข The model lacks necessary contextual information needed to well prediction the first
few symbols. (Context fragmentation)
โ€ข Longer sequences are disproportionately expensive because attention is quadratic to
the sequence length.
recurrence mechanism and a novel positional encoding scheme. Our method not only
Vanilla Transformer
Vanilla Transformer with a fixed-
length context at a training time.
Vanilla Transformer with a fixed-
length context at evaluation time.
โ€ข Context fragmentation (information
never flows across segments)
โ€ข Upper bounded by the segment
length
โ€ข The segment has to be processed all
from scratch for prediction a token.
โ€ข This evaluation procedure is extremely
expensive.
Transformer XL (extra long) [2]
Transformer-XL with segment-level
recurrence at training time.
Transformer-XL with segment-level
recurrence at evaluation time.
โ€ข It can capture longer dependency
than segment length.
โ€ข Itโ€™s fast than the vanilla transformer.
Transformer XL (extra long) [2]
Major contributions
โ€ข Segment-Level Recurrence with State Reuse
โ€ข Relative Positional Encodings
Embedding and loss
โ€ข Adaptive Input representation
โ€ข Adaptive softmax
Transformer XL (extra long) [2]
Segment-Level Recurrence with State Reuse
Transformer XL (extra long) [2]
Segment-Level Recurrence with State Reuse
โ„Ž ๐œ+1
๐‘›โˆ’1
= ๐‘†๐บ โ„Ž ๐œ
๐‘›โˆ’1
โˆ˜ โ„Ž ๐œ+1
๐‘›โˆ’1
,
๐‘ž ๐œ+1
๐‘›
, ๐‘˜ ๐œ+1
๐‘›
, ๐‘ฃ ๐œ+1
๐‘›
= โ„Ž ๐œ+1
๐‘›โˆ’1
๐‘Š๐‘ž
๐‘‡, โ„Ž ๐œ+1
๐‘›โˆ’1
๐‘Š๐‘˜
๐‘‡
, โ„Ž ๐œ+1
๐‘›โˆ’1
๐‘Š๐‘ฃ
๐‘‡,
โ„Ž ๐œ+1
๐‘›
= ๐‘‡๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘“๐‘œ๐‘Ÿ๐‘š๐‘’๐‘Ÿ โˆ’ ๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ ๐‘ž ๐œ+1
๐‘›
, ๐‘˜ ๐œ+1
๐‘›
, ๐‘ฃ ๐œ+1
๐‘›
.
Let the two consecutive segments of length L be ๐’” ๐‰ = ๐’™ ๐‰,๐Ÿ, โ‹ฏ , ๐’™ ๐‰,๐Ÿ and ๐’” ๐‰+๐Ÿ = ๐’™ ๐‰+๐Ÿ,๐Ÿ, โ‹ฏ , ๐’™ ๐‰+๐Ÿ,๐Ÿ
respectively.
The ๐‘›-ths layer hidden state sequence produced for the ๐œ-ths segment ๐‘ ๐œ by ๐’‰ ๐‰
๐’ โˆˆ โ„ ๐‘ณร—๐’…, where d is
the hidden dimension.
๐‘บ๐‘ฎ(โ‹…) : stop-gradient
๐’‰ ๐’– โˆ˜ ๐’‰ ๐’— : the concatenation of two hidden sequences along the length dimension.
Transformer XL (extra long) [2]
Segment-Level Recurrence with State Reuse
โ„Ž ๐œ+1
๐‘›โˆ’1
โ„Ž ๐œ+1
๐‘›โˆ’1
โ„Ž ๐œ
๐‘›โˆ’1
โ„Ž ๐œ+1
๐‘›
Transformer XL (extra long) [2]
Abs Positional Encodings
(sinusoidal)
๐‘จ๐‘–,๐‘—
๐‘Ž๐‘๐‘ 
= ๐‘ฌ ๐‘ฅ ๐‘–
๐‘‡ ๐‘พ ๐‘ž
๐‘‡ ๐‘พ ๐‘˜ ๐‘ฌ ๐‘ฅ ๐‘—
+ ๐‘ฌ ๐‘ฅ ๐‘–
๐‘‡
๐‘พ ๐‘ž
๐‘‡
๐‘พ ๐‘˜ ๐‘ผ๐‘—
+ ๐‘ผ๐‘–
๐‘‡
๐‘พ ๐‘ž
๐‘‡ ๐‘พ ๐‘˜ ๐‘ฌ ๐‘ฅ ๐‘—
+ ๐‘ผ๐‘–
๐‘‡
๐‘พ ๐‘ž
๐‘‡
๐‘พ ๐‘˜ ๐‘ผ๐‘—.
Relative Positional Encodings
๐‘จ๐‘–,๐‘—
๐‘Ÿ๐‘’๐‘™
= ๐‘ฌ ๐‘ฅ ๐‘–
๐‘‡ ๐‘พ ๐‘ž
๐‘‡ ๐‘พ ๐‘˜,๐ธ ๐‘ฌ ๐‘ฅ ๐‘—
+ ๐‘ฌ ๐‘ฅ ๐‘–
๐‘‡
๐‘พ ๐‘ž
๐‘‡
๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘—
+ ๐‘ข ๐‘‡
๐‘พ ๐‘˜,๐ธ ๐‘ฌ ๐‘ฅ ๐‘—
+ ๐‘ฃ ๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘—
โ€ข ๐‘ˆ๐‘— โ†’ ๐‘…๐‘–โˆ’๐‘—
โ€ข ๐‘ˆ๐‘–
๐‘‡
๐‘Š๐‘ž
๐‘‡ โ†’ ๐‘Ž ๐‘ก๐‘Ÿ๐‘Ž๐‘–๐‘›๐‘Ž๐‘๐‘™๐‘’ ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ ๐‘ข โˆˆ โ„ ๐‘‘, ๐‘ฃ โˆˆ โ„ ๐‘‘
โ€ข ๐‘Š๐‘˜ โ†’ ๐‘Š๐‘˜,๐ธ, ๐‘Š๐‘˜,๐‘… ๐‘“๐‘œ๐‘Ÿ ๐ธ ๐‘ฅ ๐‘—
, ๐‘ˆ๐‘—
Transformer XL (extra long) [2]
h ๐œ
๐‘›โˆ’1
= ๐‘†๐บ ๐‘š ๐œ
๐‘›โˆ’1
โˆ˜ โ„Ž ๐œ
๐‘›โˆ’1
๐‘ž ๐œ
๐‘›
, ๐‘˜ ๐œ
๐‘›
, ๐‘ฃ๐œ
๐‘›
= โ„Ž ๐œ
๐‘›โˆ’1
๐‘Š๐‘ž
๐‘›โˆ’1 ๐‘‡
, โ„Ž ๐œ
๐‘›โˆ’1
๐‘Š๐‘˜,๐ธ
๐‘› ๐‘‡
, โ„Ž ๐œ
๐‘›โˆ’1
๐‘Š๐‘ฃ
๐‘› ๐‘‡
๐ด ๐œ,๐‘–,๐‘—
๐‘›
= ๐‘ž ๐œ,๐‘–
๐‘› ๐‘‡
๐‘˜ ๐œ,๐‘—
๐‘›
+ ๐‘ž ๐œ,๐‘–
๐‘› ๐‘‡
๐‘Š๐‘˜,๐‘…
๐‘›
๐‘…๐‘–โˆ’๐‘— + ๐‘ข ๐‘‡
๐‘˜ ๐œ,๐‘— + ๐‘ฃ ๐‘‡
๐‘Š๐‘˜,๐‘…
๐‘›
๐‘…๐‘–โˆ’๐‘—
๐‘Ž ๐œ
๐‘›
= ๐‘€๐‘Ž๐‘ ๐‘˜๐‘’๐‘‘_๐‘†๐‘œ๐‘“๐‘ก๐‘š๐‘Ž๐‘ฅ ๐ด ๐œ
๐‘›
๐‘ฃ๐œ
๐‘›
๐‘œ๐œ
๐‘›
= ๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘๐‘œ๐‘Ÿ๐‘š ๐ฟ๐‘–๐‘›๐‘’๐‘Ž๐‘Ÿ ๐‘Ž ๐œ
๐‘›
+ โ„Ž ๐œ
๐‘›โˆ’1
โ„Ž ๐œ
๐‘›
= ๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘œ๐‘›๐‘ค๐‘–๐‘ ๐‘’_๐น๐‘’๐‘’๐‘‘_๐น๐‘œ๐‘Ÿ๐‘ค๐‘Ž๐‘Ÿ๐‘‘ ๐‘œ๐œ
๐‘›
โ„Ž ๐œ
0
โ‰” ๐ธ๐‘  ๐œ
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
๐‘ฌ ๐‘ฅ ๐‘–
๐‘‡ ๐‘พ ๐‘ž
๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘—
+ ๐‘ฃ ๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘— ๐‘„ in a reversed order
๐‘„ ๐‘˜ = ๐‘Š๐‘˜,๐‘… ๐‘… ๐‘€+๐ฟโˆ’1โˆ’๐‘˜
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
๐‘ฌ ๐‘ฅ ๐‘–
๐‘‡ ๐‘พ ๐‘ž
๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘—
+ ๐‘ฃ ๐‘‡
๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘—
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
Transformer XL (extra long) [2]
Transformer XL (extra long) [2]
Transformer XL (extra long) [2]
Transformer XL (extra long) [2] Ablation study
Transformer XL (extra long) [2]
Efficient softmax approximation for GPUs [3]
Most of the probability mass is covered by a small fraction of the
dictionary, e.g., 87% of the document is covered by only 20% of the
vocabulary in the Penn TreeBank.
๏ƒŸ Hierarchical softmax. [adopted from
(Hugo Lachorelleโ€™s Youtube lectures)]
โ€ข Hierarchical softmax
โ€ข Differentiated softmax
โ€ข Importance sampling
โ€ข Negative sampling
โ€ข Noise sampling
Efficient softmax approximation for GPUs [3]
notation
โ€ข ๐ต: batch size
โ€ข ๐‘˜ = ๐’ฑ : cardinality of total vocabulary
โ€ข ๐‘” ๐‘˜ = max(๐‘ + ๐œ†๐‘˜0, ๐‘ + ๐œ†๐‘˜) = ๐‘ ๐‘š +
max 0, ๐œ† ๐‘˜ โˆ’ ๐‘˜0 : computation time
1. The computation time ๐‘”(๐‘˜) is constant
for low values of k, until a certain
inflection point ๐‘˜0 โ‰ˆ 50, and then
becomes affine for values ๐‘˜ > ๐‘˜0.
2. Empirically, ๐‘ ๐‘š = 0.40 ๐‘š๐‘  on a K40 and
0.22 ms on a M40.
Efficient softmax approximation for GPUs [3]
๐’ฑโ„Ž
๐’ฑ๐‘ก
Notation
โ€ข ๐’ฑโ„Ž: the word set of head
โ€ข ๐’ฑ๐‘ก: the word set of tail
โ€ข ๐‘˜โ„Ž = ๐’ฑโ„Ž , ๐‘˜ ๐‘ก = ๐’ฑ๐‘ก
โ€ข ๐‘๐‘–: the probability of a word to occur in the set ๐’ฑ๐‘–.
๐ถ = ๐‘” ๐‘˜โ„Ž + 1, ๐ต + ๐‘”(๐‘˜ ๐‘ก, ๐‘๐‘ก ๐ต)
Efficient softmax approximation for GPUs [3]
๐ถโ„Ž = ๐‘”(๐ฝ + ๐‘˜โ„Ž, ๐ต)
โˆ€๐‘–, ๐ถ๐‘– = ๐‘”(๐‘˜๐‘–, ๐‘๐‘– ๐ต)
๐ถ = ๐‘” ๐ฝ + ๐‘˜โ„Ž, ๐ต + ๐‘– ๐‘” ๐‘˜๐‘–, ๐‘๐‘– ๐ต .
Constraint. ๐‘˜๐ต โ‰ฅ ๐‘˜0 ๐ต0
๐ถ = ๐‘ + ๐œ†๐ต ๐ฝ + ๐‘˜โ„Ž ๐ต + ๐‘–(๐‘ + ๐œ†๐‘˜๐‘– ๐‘๐‘– ๐ต)
= ๐ฝ + 1 ๐‘ + ๐œ†๐ต ๐ฝ + ๐‘˜โ„Ž + ๐‘– ๐‘๐‘– ๐‘˜๐‘– .
Efficient softmax approximation for GPUs [3]
๐‘๐‘– ๐‘˜๐‘– + ๐‘๐‘— ๐‘˜๐‘— = ๐‘๐‘– ๐‘˜๐‘– โˆ’ ๐‘˜๐‘— + ๐‘๐‘–+๐‘— ๐‘˜๐‘— where ๐‘๐‘–+๐‘— = ๐‘๐‘– + ๐‘๐‘—
๐ถ = ๐ฝ + 1 ๐‘ + ๐œ†๐ต ๐ฝ + ๐‘˜โ„Ž +
๐‘–
๐‘๐‘– ๐‘˜๐‘–
Assume that ๐‘˜๐‘– > ๐‘˜๐‘—, and fix the quantities ๐‘๐‘–+๐‘—, ๐‘˜๐‘– and ๐‘˜๐‘—.
The best strategy is trivially to minimize the probability of the largest cluster
๐’ฑ๐‘–.
For a fixed number of clusters of given sizes, the best strategy is to assign
the words by decreasing probabilities to cluster of increasing size.
Efficient softmax approximation for GPUs [3]
Efficient softmax approximation for GPUs [3]
Efficient softmax approximation for GPUs [3]
Adaptive Input Representations [4]
Adaptive Input Representations [4]
๐ฟ๐‘–๐‘›๐‘’๐‘Ž๐‘Ÿ๐’ฑ1
๐‘‚๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐’ฑ1
๐ฟ๐‘–๐‘›๐‘’๐‘Ž๐‘Ÿ๐’ฑ2
๐ฟ๐‘–๐‘›๐‘’๐‘Ž๐‘Ÿ๐’ฑ ๐‘›
โ‹ฏ
๐‘‚๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐’ฑ2
๐‘‚๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐’ฑ ๐‘›โ‹ฏ
๐‘‘
๐‘‘ โ†’ ๐‘‘ + ๐‘› โˆ’ 1 ๐‘‘ โ†’
๐‘‘
๐‘˜1
๐‘‘ โ†’
๐‘‘
๐‘˜ ๐‘›โˆ’1
๐’ฑ1 + ๐‘› โˆ’ 1 ๐’ฑ2
๐’ฑ๐‘›
References
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. CoRR, abs/1706.03762, 2017
[2] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language
models beyond a fixed-length context. CoRR, abs/1901.02860, 2019
[3] Edouard Grave, Armand Joulin, Moustapha Cisse, David Grangier, and Herve Jegou. Efficient
softmax approximation for gpus.
CoRR, abs/1609.04309, 2016
[4] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling.
CoRR, abs/1809.10853, 2018.

Weitere รคhnliche Inhalte

Was ist angesagt?

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
ย 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPInsoo Chung
ย 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
ย 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
ย 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
ย 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUSri Geetha
ย 
XLNet Presentation.pdf
XLNet Presentation.pdfXLNet Presentation.pdf
XLNet Presentation.pdfSivaKumar458905
ย 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club
ย 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?Traian Rebedea
ย 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!
ย 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
ย 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
ย 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You NeedIllia Polosukhin
ย 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
ย 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
ย 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
ย 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
ย 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
ย 

Was ist angesagt? (20)

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
ย 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
ย 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
ย 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
ย 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
ย 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
ย 
XLNet Presentation.pdf
XLNet Presentation.pdfXLNet Presentation.pdf
XLNet Presentation.pdf
ย 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
ย 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
ย 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
ย 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ย 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
ย 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
ย 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
ย 
BERT introduction
BERT introductionBERT introduction
BERT introduction
ย 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
ย 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ย 
Word2Vec
Word2VecWord2Vec
Word2Vec
ย 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
ย 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
ย 

ร„hnlich wie Transformer xl

Umbra Ignite 2015: Rulon Raymond โ€“ The State of Skinning โ€“ a dive into modern...
Umbra Ignite 2015: Rulon Raymond โ€“ The State of Skinning โ€“ a dive into modern...Umbra Ignite 2015: Rulon Raymond โ€“ The State of Skinning โ€“ a dive into modern...
Umbra Ignite 2015: Rulon Raymond โ€“ The State of Skinning โ€“ a dive into modern...Umbra Software
ย 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function๋ฒ”์ค€ ๊น€
ย 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2San Kim
ย 
Optimal Control in Agent-based Economics: A Survey
Optimal Control in Agent-based Economics: A SurveyOptimal Control in Agent-based Economics: A Survey
Optimal Control in Agent-based Economics: A SurveyJames Matthew Miraflor
ย 
Tavas and pashmaks
Tavas and pashmaksTavas and pashmaks
Tavas and pashmaksminsu kim
ย 
07-Convolution.pptx signal spectra and signal processing
07-Convolution.pptx signal spectra and signal processing07-Convolution.pptx signal spectra and signal processing
07-Convolution.pptx signal spectra and signal processingJordanJohmMallillin
ย 
05.scd_cuantificacion_y_senales_de_prueba
05.scd_cuantificacion_y_senales_de_prueba05.scd_cuantificacion_y_senales_de_prueba
05.scd_cuantificacion_y_senales_de_pruebaHipรณlito Aguilar
ย 
Error Correction 14_03_2022.pptx
Error Correction 14_03_2022.pptxError Correction 14_03_2022.pptx
Error Correction 14_03_2022.pptxRonCohen53
ย 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
ย 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES
ย 
Mk slides.ppt
Mk slides.pptMk slides.ppt
Mk slides.pptTabassum Saher
ย 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsWeitao Duan
ย 
Applied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure - Rohit PandeyApplied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure - Rohit PandeyWithTheBest
ย 
SVD.ppt
SVD.pptSVD.ppt
SVD.pptcmpt cmpt
ย 
Quantum Computing 101, Part 2 - Hello Entangled World
Quantum Computing 101, Part 2 - Hello Entangled WorldQuantum Computing 101, Part 2 - Hello Entangled World
Quantum Computing 101, Part 2 - Hello Entangled WorldAaronTurner9
ย 
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...Chao Liu
ย 
2. filtering basics
2. filtering basics2. filtering basics
2. filtering basicsAtul Kumar Jha
ย 

ร„hnlich wie Transformer xl (20)

Umbra Ignite 2015: Rulon Raymond โ€“ The State of Skinning โ€“ a dive into modern...
Umbra Ignite 2015: Rulon Raymond โ€“ The State of Skinning โ€“ a dive into modern...Umbra Ignite 2015: Rulon Raymond โ€“ The State of Skinning โ€“ a dive into modern...
Umbra Ignite 2015: Rulon Raymond โ€“ The State of Skinning โ€“ a dive into modern...
ย 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
ย 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
ย 
Optimal Control in Agent-based Economics: A Survey
Optimal Control in Agent-based Economics: A SurveyOptimal Control in Agent-based Economics: A Survey
Optimal Control in Agent-based Economics: A Survey
ย 
Tavas and pashmaks
Tavas and pashmaksTavas and pashmaks
Tavas and pashmaks
ย 
07-Convolution.pptx signal spectra and signal processing
07-Convolution.pptx signal spectra and signal processing07-Convolution.pptx signal spectra and signal processing
07-Convolution.pptx signal spectra and signal processing
ย 
05.scd_cuantificacion_y_senales_de_prueba
05.scd_cuantificacion_y_senales_de_prueba05.scd_cuantificacion_y_senales_de_prueba
05.scd_cuantificacion_y_senales_de_prueba
ย 
B010310813
B010310813B010310813
B010310813
ย 
Error Correction 14_03_2022.pptx
Error Correction 14_03_2022.pptxError Correction 14_03_2022.pptx
Error Correction 14_03_2022.pptx
ย 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
ย 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
ย 
Mk slides.ppt
Mk slides.pptMk slides.ppt
Mk slides.ppt
ย 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile Metrics
ย 
Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Av 738- Adaptive Filtering - Wiener Filters[wk 3]Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Av 738- Adaptive Filtering - Wiener Filters[wk 3]
ย 
Applied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure - Rohit PandeyApplied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure - Rohit Pandey
ย 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
ย 
Quantum Computing 101, Part 2 - Hello Entangled World
Quantum Computing 101, Part 2 - Hello Entangled WorldQuantum Computing 101, Part 2 - Hello Entangled World
Quantum Computing 101, Part 2 - Hello Entangled World
ย 
Av 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background MaterialAv 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background Material
ย 
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
ย 
2. filtering basics
2. filtering basics2. filtering basics
2. filtering basics
ย 

Mehr von San Kim

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...San Kim
ย 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
ย 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxSan Kim
ย 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxSan Kim
ย 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxSan Kim
ย 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...San Kim
ย 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptxSan Kim
ย 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning taskSan Kim
ย 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalSan Kim
ย 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understandingSan Kim
ย 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoningSan Kim
ย 
Electra
ElectraElectra
ElectraSan Kim
ย 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1San Kim
ย 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
ย 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3San Kim
ย 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1San Kim
ย 
Back propagation
Back propagationBack propagation
Back propagationSan Kim
ย 

Mehr von San Kim (17)

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
ย 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
ย 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
ย 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
ย 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptx
ย 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
ย 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptx
ย 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning task
ย 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrieval
ย 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
ย 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
ย 
Electra
ElectraElectra
Electra
ย 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1
ย 
Gan seminar
Gan seminarGan seminar
Gan seminar
ย 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
ย 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
ย 
Back propagation
Back propagationBack propagation
Back propagation
ย 

Kรผrzlich hochgeladen

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
ย 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
ย 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
ย 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
ย 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
ย 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
ย 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
ย 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
ย 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
ย 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
ย 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
ย 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
ย 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
ย 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
ย 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
ย 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
ย 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
ย 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
ย 
Navi Mumbai Call Girls ๐Ÿฅฐ 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls ๐Ÿฅฐ 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls ๐Ÿฅฐ 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls ๐Ÿฅฐ 8617370543 Service Offer VIP Hot ModelDeepika Singh
ย 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
ย 

Kรผrzlich hochgeladen (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
ย 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
ย 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
ย 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
ย 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
ย 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
ย 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
ย 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
ย 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
ย 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
ย 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
ย 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
ย 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ย 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ย 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
ย 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
ย 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
ย 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
ย 
Navi Mumbai Call Girls ๐Ÿฅฐ 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls ๐Ÿฅฐ 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls ๐Ÿฅฐ 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls ๐Ÿฅฐ 8617370543 Service Offer VIP Hot Model
ย 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
ย 

Transformer xl

  • 1. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context San Kim 2019. 08. 23 Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
  • 4. Transformer architecture [1] [CLS] family isn ' t about whose blood you have . it ' s about who you care about . [SEP] [pad] [pad] [pad] 101 11214 65148 112 162 10935 17060 15465 10855 10574 119 10197 112 161 10935 10488 10855 11258 10935 119 102 0 0 0 0.01 -0.05 0.05 -0.05 -0.03 0 0.05 -0.03 -0.03 -0.01 -0.04 -0.04 -0.05 -0.04 0 -0.03 -0.03 -0.04 0 -0.04 -0.01 0.08 0.08 0.08 -0.01 0.03 0.04 0.04 -0.09 0.07 -0.01 -0.06 -0.01 -0.03 -0.01 -0.01 0.04 -0.01 0.07 -0.04 -0.01 -0.06 0.07 -0.01 -0.02 0.04 0.04 0.04 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 0.01 0.03 0.02 -0.02 0.01 0.06 0.08 -0.05 -0.01 -0.05 0.03 0.04 -0.02 0.05 0.06 -0.03 -0.01 0.08 0.06 0.03 0 0.01 0.01 0.01 0 0 0.1 -0.04 0.05 -0.01 -0.1 -0.06 0.04 -0.04 0.02 0.04 -0.04 0.04 -0.01 -0.03 0.04 -0.09 -0.01 0.02 -0.01 0 0 0 Input tokens Input ids Embedding ๐‘‘๐‘’๐‘š๐‘ ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘›
  • 5. Transformer architecture [1] -0.01 -0.01 0 -0.04 0.03 ... -0.03 0 -0.03 0 0 -0.01 0 0.01 ... 0.01 0 0 ... ... ... ... ... ... ... ... ... -0.01 0 -0.01 0 0 ... 0.01 0.01 0 0 -0.03 -0.02 -0.01 -0.01 ... -0.02 -0.02 -0.03 Positional Encoding (sinusoidal or learned positional embedding) [CLS] family isn ' t ... care about . 0.01 -0.05 0.05 -0.05 -0.03 ... -0.04 0 -0.04 -0.01 0.03 0.04 0.04 -0.09 ... -0.06 0.07 -0.01 ... ... ... ... ... ... ... ... ... 0.01 0.03 0.02 -0.02 0.01 ... 0.08 0.06 0.03 0 0 0.1 -0.04 0.05 ... -0.09 -0.01 0.02 Word embedding [CLS] family isn ' t ... care about . 0 -0.06 0.05 -0.09 0 ... -0.06 0 -0.07 -0.01 0.04 0.04 0.05 -0.08 ... -0.04 0.08 -0.01 ... ... ... ... ... ... ... ... ... 0.01 0.02 0.01 -0.02 0 ... 0.09 0.06 0.03 0 -0.03 0.07 -0.05 0.04 ... -0.11 -0.03 -0.01 Positional Encoding + Word embedding ๐‘‘ ๐‘’๐‘š๐‘ ร— ๐‘ ๐‘’๐‘ž๐‘™๐‘’๐‘›
  • 6. Transformer architecture [1] Sinusoidal positional encoding โ€ข For any fixed offset ๐‘˜, ๐‘ƒ๐ธ ๐‘๐‘œ๐‘ +๐‘˜ can be represented as a linear function of ๐‘ƒ๐ธ ๐‘๐‘œ๐‘ .
  • 7. Transformer architecture [1] Sinusoidal positional embeddings A model trained on a memory of some certain length can automatically generalize to a memory several times long during evaluation. learned positional embeddings
  • 8. Transformer architecture [1] [CLS] family ... [pad] [pad] 0.09 -0.63 ... 2.26 0.96 -0.14 0.55 ... 0.95 1.07 ... ... ... ... ... 0.01 0.35 ... 0.11 0.29 -0.02 -0.4 ... -0.39 -0.21 Positional Encoding + Word embedding ๐‘‘ ๐‘’๐‘š๐‘ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐ธ๐‘š๐‘ ๐‘ž 0.02 0.04 0 ... 0.01 0.05 -0.1 -0.02 0.07 0.03 ... 0.03 -0.02 0 ... ... ... ... ... ... ... 0.1 -0.06 -0.02 ... -0.09 -0.04 -0.02 0.05 0.04 0.06 ... 0 -0.05 0.01 ๐‘‘ ๐‘ž ร— ๐‘‘ ๐‘’๐‘š๐‘ ๐‘Š๐‘ž -0.63 -0.17 ... 0.06 -0.61 1 ร— ๐‘‘ ๐‘ž ๐ต๐‘ž ๐‘„ ๐‘๐‘Ÿ๐‘œ๐‘— = ๐ธ๐‘š๐‘ ๐‘ž ๐‘‡ ๐‘Š๐‘ž ๐‘‡ + ๐ต๐‘ž ๐‘‡ ๐พ๐‘๐‘Ÿ๐‘œ๐‘— = ๐ธ๐‘š๐‘ ๐‘˜ ๐‘‡ ๐‘Š๐‘˜ ๐‘‡ + ๐ต ๐‘˜ ๐‘‡ ๐‘‰๐‘๐‘Ÿ๐‘œ๐‘— = ๐ธ๐‘š๐‘ ๐‘ฃ ๐‘‡ ๐‘Š๐‘ฃ ๐‘‡ + ๐ต๐‘ฃ ๐‘‡
  • 9. Transformer architecture [1] [CLS] family ... [pad] [pad] -0.76 -1.25 ... 0.38 -0.01 0 0.29 ... -1.9 -1.33 ... ... ... ... ... -1.54 1.33 ... 1.56 -0.12 -1.48 -0.11 ... 0.24 -0.24 ๐‘„ ๐‘๐‘Ÿ๐‘œ๐‘— ๐‘‘ ๐‘ž ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› [CLS] family ... [pad] [pad] -3.36 -1.59 ... 1.27 1.99 -2.38 -1.82 ... -0.53 -0.29 ... ... ... ... ... 1.13 1.66 ... -0.1 1.36 -2.61 -0.89 ... 1.77 1.88 ๐พ ๐‘๐‘Ÿ๐‘œ๐‘— ๐‘‘ ๐‘˜ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘Ž๐‘ก๐‘ก๐‘’๐‘›๐‘ก๐‘–๐‘œ๐‘›_๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’๐‘  = ๐‘„ ๐‘๐‘Ÿ๐‘œ๐‘— ๐‘‡ ๐พ๐‘๐‘Ÿ๐‘œ๐‘— ๐‘‘ ๐‘ž ๐‘‘ ๐‘ž = ๐‘‘ ๐‘˜
  • 11. Transformer architecture [1] ๐‘Ž๐‘ก๐‘ก_๐‘๐‘Ÿ๐‘œ๐‘๐‘  = ๐‘ ๐‘š(๐‘Ž๐‘ก๐‘ก_๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’๐‘ )
  • 12. Transformer architecture [1] [CLS] family ... [pad] [pad] 0.86 0 ... 0 0 0.06 0.04 ... 0 0 ... ... ... ... ... 0.15 0 ... 0 0 0.19 0 ... 0 0 ๐‘Ž๐‘ก๐‘ก_๐‘๐‘Ÿ๐‘œ๐‘๐‘  [CLS] family ... [pad] [pad] -0.2 -1.08 ... -0.24 -0.26 0.09 -1.09 ... 0.81 0.79 ... ... ... ... ... 0.28 -0.15 ... 0.74 0.76 0.07 0.72 ... 0.55 0.56 ๐‘‰๐‘๐‘Ÿ๐‘œ๐‘— ๐‘‘ ๐‘ฃ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ฃ ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ž ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘˜ ๐‘๐‘œ๐‘›๐‘ก๐‘’๐‘ฅ๐‘ก_๐‘ฃ๐‘’๐‘๐‘ก๐‘œ๐‘Ÿ๐‘  = ๐‘Ž๐‘ก๐‘ก ๐‘๐‘Ÿ๐‘œ๐‘๐‘  ร— ๐‘‰๐‘๐‘Ÿ๐‘œ๐‘— ๐‘‡ ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘˜ = ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ฃ
  • 13. Transformer architecture [1] Family isn't about whose blood you have. It's about who you care about. ๐‘ฆ๐‘œ๐‘ข ๐‘ž = 0.63 ร— โ„Ž๐‘Ž๐‘ฃ๐‘’ ๐‘ฃ + 0.08 ร— ๐‘๐‘™๐‘œ๐‘œ๐‘‘ ๐‘ฃ + 0.07 ร— ๐‘คโ„Ž๐‘œ๐‘ ๐‘’ ๐‘ฃ ๐‘คโ„Ž๐‘œ ๐‘ž = 0.32 ร— ๐‘ฆ๐‘œ๐‘ข ๐‘ฃ + 0.24 ร— ๐‘Ž๐‘๐‘œ๐‘ข๐‘ก ๐‘ฃ + 0.11 ร— โ„Ž๐‘Ž๐‘ฃ๐‘’ ๐‘ฃ ๐‘ฆ๐‘œ๐‘ข ๐‘ž = 0.34 ร— ๐‘คโ„Ž๐‘œ ๐‘ฃ + 0.27 ร— ๐‘๐‘Ž๐‘Ÿ๐‘’ ๐‘ฃ
  • 14. Transformer architecture [1] ๐‘๐‘œ๐‘›๐‘ก๐‘’๐‘ฅ๐‘ก_๐‘ฃ๐‘’๐‘๐‘ก๐‘œ๐‘Ÿ๐‘  [CLS] family ... [pad] [pad] -0.19 -0.25 ... 0.2 0.28 0.01 -0.33 ... -0.12 0.11 ... ... ... ... ... 0.23 -0.09 ... 0.24 0.33 0.05 0.15 ... -0.26 -0.29 ๐‘‘ ๐‘ฃ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ž [CLS] family ... [pad] [pad] -0.03 -0.21 ... 0.81 -0.04 0.04 1.17 ... 0.36 0.44 ... ... ... ... ... -0.02 0.6 ... -0.18 -0.03 0.01 -0.28 ... -0.28 -0.15 ๐‘Ž๐‘ก๐‘ก_๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘œ๐‘ข๐‘ก ๐‘’๐‘š๐‘ ๐‘‘ ร— ๐‘ ๐‘’๐‘ž_๐‘™๐‘’๐‘› ๐‘ž ๐‘Ž๐‘ก๐‘ก_๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘œ๐‘ข๐‘ก = ๐‘๐‘œ๐‘›๐‘๐‘Ž๐‘ก( ๐‘๐‘œ๐‘›๐‘ก๐‘’๐‘ฅ๐‘ก ๐‘ฃ๐‘’๐‘๐‘ก๐‘œ๐‘Ÿ๐‘  ๐‘– i โˆˆ 0, ๐‘ , ๐‘– < ๐‘›๐‘ข๐‘šโ„Ž๐‘’๐‘Ž๐‘‘
  • 15. Transformer architecture [1] ๐น๐น๐‘ = max 0, ๐‘ฅ๐‘Š1 + ๐‘1 ๐‘Š2 + ๐‘2 ๐‘ฅ seq_len ๐‘ž ร— ๐‘‘ ๐‘’๐‘š๐‘ ๐‘Š1 ๐‘‘ ๐‘’๐‘š๐‘ ร— ๐‘‘๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ ๐‘1 1 ร— ๐‘‘๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ ๐‘Š2 ๐‘‘๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ ร— ๐‘‘ ๐‘’๐‘š๐‘ ๐‘2 1 ร— ๐‘‘ ๐‘’๐‘š๐‘ ๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘๐‘œ๐‘Ÿ๐‘š ๐‘ฅ + ๐‘‘๐‘Ÿ๐‘œ๐‘๐‘œ๐‘ข๐‘ก ๐‘ ๐‘ข๐‘๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ ๐‘ฅ
  • 16. Transformer architecture [1] The memory cost for scaled dot-product attention is quadratic w.r.t. the sequence length.
  • 17. Transformer architecture [1] Pros โ€ข Itโ€™s enables the learning of long-term dependency. โ€ข It is less affected by the gradient vanishing problem compared to RNN. Cons โ€ข The model cannot capture any longer-dependency beyond the predefined context length. โ€ข The model lacks necessary contextual information needed to well prediction the first few symbols. (Context fragmentation) โ€ข Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. recurrence mechanism and a novel positional encoding scheme. Our method not only
  • 18. Vanilla Transformer Vanilla Transformer with a fixed- length context at a training time. Vanilla Transformer with a fixed- length context at evaluation time. โ€ข Context fragmentation (information never flows across segments) โ€ข Upper bounded by the segment length โ€ข The segment has to be processed all from scratch for prediction a token. โ€ข This evaluation procedure is extremely expensive.
  • 19. Transformer XL (extra long) [2] Transformer-XL with segment-level recurrence at training time. Transformer-XL with segment-level recurrence at evaluation time. โ€ข It can capture longer dependency than segment length. โ€ข Itโ€™s fast than the vanilla transformer.
  • 20. Transformer XL (extra long) [2] Major contributions โ€ข Segment-Level Recurrence with State Reuse โ€ข Relative Positional Encodings Embedding and loss โ€ข Adaptive Input representation โ€ข Adaptive softmax
  • 21. Transformer XL (extra long) [2] Segment-Level Recurrence with State Reuse
  • 22. Transformer XL (extra long) [2] Segment-Level Recurrence with State Reuse โ„Ž ๐œ+1 ๐‘›โˆ’1 = ๐‘†๐บ โ„Ž ๐œ ๐‘›โˆ’1 โˆ˜ โ„Ž ๐œ+1 ๐‘›โˆ’1 , ๐‘ž ๐œ+1 ๐‘› , ๐‘˜ ๐œ+1 ๐‘› , ๐‘ฃ ๐œ+1 ๐‘› = โ„Ž ๐œ+1 ๐‘›โˆ’1 ๐‘Š๐‘ž ๐‘‡, โ„Ž ๐œ+1 ๐‘›โˆ’1 ๐‘Š๐‘˜ ๐‘‡ , โ„Ž ๐œ+1 ๐‘›โˆ’1 ๐‘Š๐‘ฃ ๐‘‡, โ„Ž ๐œ+1 ๐‘› = ๐‘‡๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘“๐‘œ๐‘Ÿ๐‘š๐‘’๐‘Ÿ โˆ’ ๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ ๐‘ž ๐œ+1 ๐‘› , ๐‘˜ ๐œ+1 ๐‘› , ๐‘ฃ ๐œ+1 ๐‘› . Let the two consecutive segments of length L be ๐’” ๐‰ = ๐’™ ๐‰,๐Ÿ, โ‹ฏ , ๐’™ ๐‰,๐Ÿ and ๐’” ๐‰+๐Ÿ = ๐’™ ๐‰+๐Ÿ,๐Ÿ, โ‹ฏ , ๐’™ ๐‰+๐Ÿ,๐Ÿ respectively. The ๐‘›-ths layer hidden state sequence produced for the ๐œ-ths segment ๐‘ ๐œ by ๐’‰ ๐‰ ๐’ โˆˆ โ„ ๐‘ณร—๐’…, where d is the hidden dimension. ๐‘บ๐‘ฎ(โ‹…) : stop-gradient ๐’‰ ๐’– โˆ˜ ๐’‰ ๐’— : the concatenation of two hidden sequences along the length dimension.
  • 23. Transformer XL (extra long) [2] Segment-Level Recurrence with State Reuse โ„Ž ๐œ+1 ๐‘›โˆ’1 โ„Ž ๐œ+1 ๐‘›โˆ’1 โ„Ž ๐œ ๐‘›โˆ’1 โ„Ž ๐œ+1 ๐‘›
  • 24. Transformer XL (extra long) [2] Abs Positional Encodings (sinusoidal) ๐‘จ๐‘–,๐‘— ๐‘Ž๐‘๐‘  = ๐‘ฌ ๐‘ฅ ๐‘– ๐‘‡ ๐‘พ ๐‘ž ๐‘‡ ๐‘พ ๐‘˜ ๐‘ฌ ๐‘ฅ ๐‘— + ๐‘ฌ ๐‘ฅ ๐‘– ๐‘‡ ๐‘พ ๐‘ž ๐‘‡ ๐‘พ ๐‘˜ ๐‘ผ๐‘— + ๐‘ผ๐‘– ๐‘‡ ๐‘พ ๐‘ž ๐‘‡ ๐‘พ ๐‘˜ ๐‘ฌ ๐‘ฅ ๐‘— + ๐‘ผ๐‘– ๐‘‡ ๐‘พ ๐‘ž ๐‘‡ ๐‘พ ๐‘˜ ๐‘ผ๐‘—. Relative Positional Encodings ๐‘จ๐‘–,๐‘— ๐‘Ÿ๐‘’๐‘™ = ๐‘ฌ ๐‘ฅ ๐‘– ๐‘‡ ๐‘พ ๐‘ž ๐‘‡ ๐‘พ ๐‘˜,๐ธ ๐‘ฌ ๐‘ฅ ๐‘— + ๐‘ฌ ๐‘ฅ ๐‘– ๐‘‡ ๐‘พ ๐‘ž ๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘— + ๐‘ข ๐‘‡ ๐‘พ ๐‘˜,๐ธ ๐‘ฌ ๐‘ฅ ๐‘— + ๐‘ฃ ๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘— โ€ข ๐‘ˆ๐‘— โ†’ ๐‘…๐‘–โˆ’๐‘— โ€ข ๐‘ˆ๐‘– ๐‘‡ ๐‘Š๐‘ž ๐‘‡ โ†’ ๐‘Ž ๐‘ก๐‘Ÿ๐‘Ž๐‘–๐‘›๐‘Ž๐‘๐‘™๐‘’ ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ ๐‘ข โˆˆ โ„ ๐‘‘, ๐‘ฃ โˆˆ โ„ ๐‘‘ โ€ข ๐‘Š๐‘˜ โ†’ ๐‘Š๐‘˜,๐ธ, ๐‘Š๐‘˜,๐‘… ๐‘“๐‘œ๐‘Ÿ ๐ธ ๐‘ฅ ๐‘— , ๐‘ˆ๐‘—
  • 25. Transformer XL (extra long) [2] h ๐œ ๐‘›โˆ’1 = ๐‘†๐บ ๐‘š ๐œ ๐‘›โˆ’1 โˆ˜ โ„Ž ๐œ ๐‘›โˆ’1 ๐‘ž ๐œ ๐‘› , ๐‘˜ ๐œ ๐‘› , ๐‘ฃ๐œ ๐‘› = โ„Ž ๐œ ๐‘›โˆ’1 ๐‘Š๐‘ž ๐‘›โˆ’1 ๐‘‡ , โ„Ž ๐œ ๐‘›โˆ’1 ๐‘Š๐‘˜,๐ธ ๐‘› ๐‘‡ , โ„Ž ๐œ ๐‘›โˆ’1 ๐‘Š๐‘ฃ ๐‘› ๐‘‡ ๐ด ๐œ,๐‘–,๐‘— ๐‘› = ๐‘ž ๐œ,๐‘– ๐‘› ๐‘‡ ๐‘˜ ๐œ,๐‘— ๐‘› + ๐‘ž ๐œ,๐‘– ๐‘› ๐‘‡ ๐‘Š๐‘˜,๐‘… ๐‘› ๐‘…๐‘–โˆ’๐‘— + ๐‘ข ๐‘‡ ๐‘˜ ๐œ,๐‘— + ๐‘ฃ ๐‘‡ ๐‘Š๐‘˜,๐‘… ๐‘› ๐‘…๐‘–โˆ’๐‘— ๐‘Ž ๐œ ๐‘› = ๐‘€๐‘Ž๐‘ ๐‘˜๐‘’๐‘‘_๐‘†๐‘œ๐‘“๐‘ก๐‘š๐‘Ž๐‘ฅ ๐ด ๐œ ๐‘› ๐‘ฃ๐œ ๐‘› ๐‘œ๐œ ๐‘› = ๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘๐‘œ๐‘Ÿ๐‘š ๐ฟ๐‘–๐‘›๐‘’๐‘Ž๐‘Ÿ ๐‘Ž ๐œ ๐‘› + โ„Ž ๐œ ๐‘›โˆ’1 โ„Ž ๐œ ๐‘› = ๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘œ๐‘›๐‘ค๐‘–๐‘ ๐‘’_๐น๐‘’๐‘’๐‘‘_๐น๐‘œ๐‘Ÿ๐‘ค๐‘Ž๐‘Ÿ๐‘‘ ๐‘œ๐œ ๐‘› โ„Ž ๐œ 0 โ‰” ๐ธ๐‘  ๐œ
  • 26. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding ๐‘ฌ ๐‘ฅ ๐‘– ๐‘‡ ๐‘พ ๐‘ž ๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘— + ๐‘ฃ ๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘— ๐‘„ in a reversed order ๐‘„ ๐‘˜ = ๐‘Š๐‘˜,๐‘… ๐‘… ๐‘€+๐ฟโˆ’1โˆ’๐‘˜
  • 27. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding ๐‘ฌ ๐‘ฅ ๐‘– ๐‘‡ ๐‘พ ๐‘ž ๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘— + ๐‘ฃ ๐‘‡ ๐‘พ ๐‘˜,๐‘… ๐‘น๐‘–โˆ’๐‘—
  • 28. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding
  • 29. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding
  • 30. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding
  • 31. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding
  • 35. Transformer XL (extra long) [2] Ablation study
  • 37. Efficient softmax approximation for GPUs [3] Most of the probability mass is covered by a small fraction of the dictionary, e.g., 87% of the document is covered by only 20% of the vocabulary in the Penn TreeBank. ๏ƒŸ Hierarchical softmax. [adopted from (Hugo Lachorelleโ€™s Youtube lectures)] โ€ข Hierarchical softmax โ€ข Differentiated softmax โ€ข Importance sampling โ€ข Negative sampling โ€ข Noise sampling
  • 38. Efficient softmax approximation for GPUs [3] notation โ€ข ๐ต: batch size โ€ข ๐‘˜ = ๐’ฑ : cardinality of total vocabulary โ€ข ๐‘” ๐‘˜ = max(๐‘ + ๐œ†๐‘˜0, ๐‘ + ๐œ†๐‘˜) = ๐‘ ๐‘š + max 0, ๐œ† ๐‘˜ โˆ’ ๐‘˜0 : computation time 1. The computation time ๐‘”(๐‘˜) is constant for low values of k, until a certain inflection point ๐‘˜0 โ‰ˆ 50, and then becomes affine for values ๐‘˜ > ๐‘˜0. 2. Empirically, ๐‘ ๐‘š = 0.40 ๐‘š๐‘  on a K40 and 0.22 ms on a M40.
  • 39. Efficient softmax approximation for GPUs [3] ๐’ฑโ„Ž ๐’ฑ๐‘ก Notation โ€ข ๐’ฑโ„Ž: the word set of head โ€ข ๐’ฑ๐‘ก: the word set of tail โ€ข ๐‘˜โ„Ž = ๐’ฑโ„Ž , ๐‘˜ ๐‘ก = ๐’ฑ๐‘ก โ€ข ๐‘๐‘–: the probability of a word to occur in the set ๐’ฑ๐‘–. ๐ถ = ๐‘” ๐‘˜โ„Ž + 1, ๐ต + ๐‘”(๐‘˜ ๐‘ก, ๐‘๐‘ก ๐ต)
  • 40. Efficient softmax approximation for GPUs [3] ๐ถโ„Ž = ๐‘”(๐ฝ + ๐‘˜โ„Ž, ๐ต) โˆ€๐‘–, ๐ถ๐‘– = ๐‘”(๐‘˜๐‘–, ๐‘๐‘– ๐ต) ๐ถ = ๐‘” ๐ฝ + ๐‘˜โ„Ž, ๐ต + ๐‘– ๐‘” ๐‘˜๐‘–, ๐‘๐‘– ๐ต . Constraint. ๐‘˜๐ต โ‰ฅ ๐‘˜0 ๐ต0 ๐ถ = ๐‘ + ๐œ†๐ต ๐ฝ + ๐‘˜โ„Ž ๐ต + ๐‘–(๐‘ + ๐œ†๐‘˜๐‘– ๐‘๐‘– ๐ต) = ๐ฝ + 1 ๐‘ + ๐œ†๐ต ๐ฝ + ๐‘˜โ„Ž + ๐‘– ๐‘๐‘– ๐‘˜๐‘– .
  • 41. Efficient softmax approximation for GPUs [3] ๐‘๐‘– ๐‘˜๐‘– + ๐‘๐‘— ๐‘˜๐‘— = ๐‘๐‘– ๐‘˜๐‘– โˆ’ ๐‘˜๐‘— + ๐‘๐‘–+๐‘— ๐‘˜๐‘— where ๐‘๐‘–+๐‘— = ๐‘๐‘– + ๐‘๐‘— ๐ถ = ๐ฝ + 1 ๐‘ + ๐œ†๐ต ๐ฝ + ๐‘˜โ„Ž + ๐‘– ๐‘๐‘– ๐‘˜๐‘– Assume that ๐‘˜๐‘– > ๐‘˜๐‘—, and fix the quantities ๐‘๐‘–+๐‘—, ๐‘˜๐‘– and ๐‘˜๐‘—. The best strategy is trivially to minimize the probability of the largest cluster ๐’ฑ๐‘–. For a fixed number of clusters of given sizes, the best strategy is to assign the words by decreasing probabilities to cluster of increasing size.
  • 46. Adaptive Input Representations [4] ๐ฟ๐‘–๐‘›๐‘’๐‘Ž๐‘Ÿ๐’ฑ1 ๐‘‚๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐’ฑ1 ๐ฟ๐‘–๐‘›๐‘’๐‘Ž๐‘Ÿ๐’ฑ2 ๐ฟ๐‘–๐‘›๐‘’๐‘Ž๐‘Ÿ๐’ฑ ๐‘› โ‹ฏ ๐‘‚๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐’ฑ2 ๐‘‚๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐’ฑ ๐‘›โ‹ฏ ๐‘‘ ๐‘‘ โ†’ ๐‘‘ + ๐‘› โˆ’ 1 ๐‘‘ โ†’ ๐‘‘ ๐‘˜1 ๐‘‘ โ†’ ๐‘‘ ๐‘˜ ๐‘›โˆ’1 ๐’ฑ1 + ๐‘› โˆ’ 1 ๐’ฑ2 ๐’ฑ๐‘›
  • 47. References [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017 [2] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860, 2019 [3] Edouard Grave, Armand Joulin, Moustapha Cisse, David Grangier, and Herve Jegou. Efficient softmax approximation for gpus. CoRR, abs/1609.04309, 2016 [4] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018.