This slide introduces transformer-xl which is the base paper for xl-net. You can understand what is the major contribution of this paper using this slide. This slide also explains the transformer for comparing differences between transformer and transformer-xl. Happy NLP!
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
ย
Transformer xl
1. Transformer-XL: Attentive Language Models
Beyond a Fixed-Length Context
San Kim
2019. 08. 23
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
6. Transformer architecture [1]
Sinusoidal positional encoding
โข For any fixed offset ๐, ๐๐ธ ๐๐๐ +๐ can be
represented as a linear function of ๐๐ธ ๐๐๐ .
7. Transformer architecture [1]
Sinusoidal positional embeddings
A model trained on a memory of some certain
length can automatically generalize to a memory
several times long during evaluation.
learned positional embeddings
17. Transformer architecture [1]
Pros
โข Itโs enables the learning of long-term dependency.
โข It is less affected by the gradient vanishing problem compared to RNN.
Cons
โข The model cannot capture any longer-dependency beyond the predefined context
length.
โข The model lacks necessary contextual information needed to well prediction the first
few symbols. (Context fragmentation)
โข Longer sequences are disproportionately expensive because attention is quadratic to
the sequence length.
recurrence mechanism and a novel positional encoding scheme. Our method not only
18. Vanilla Transformer
Vanilla Transformer with a fixed-
length context at a training time.
Vanilla Transformer with a fixed-
length context at evaluation time.
โข Context fragmentation (information
never flows across segments)
โข Upper bounded by the segment
length
โข The segment has to be processed all
from scratch for prediction a token.
โข This evaluation procedure is extremely
expensive.
19. Transformer XL (extra long) [2]
Transformer-XL with segment-level
recurrence at training time.
Transformer-XL with segment-level
recurrence at evaluation time.
โข It can capture longer dependency
than segment length.
โข Itโs fast than the vanilla transformer.
20. Transformer XL (extra long) [2]
Major contributions
โข Segment-Level Recurrence with State Reuse
โข Relative Positional Encodings
Embedding and loss
โข Adaptive Input representation
โข Adaptive softmax
37. Efficient softmax approximation for GPUs [3]
Most of the probability mass is covered by a small fraction of the
dictionary, e.g., 87% of the document is covered by only 20% of the
vocabulary in the Penn TreeBank.
๏ Hierarchical softmax. [adopted from
(Hugo Lachorelleโs Youtube lectures)]
โข Hierarchical softmax
โข Differentiated softmax
โข Importance sampling
โข Negative sampling
โข Noise sampling
38. Efficient softmax approximation for GPUs [3]
notation
โข ๐ต: batch size
โข ๐ = ๐ฑ : cardinality of total vocabulary
โข ๐ ๐ = max(๐ + ๐๐0, ๐ + ๐๐) = ๐ ๐ +
max 0, ๐ ๐ โ ๐0 : computation time
1. The computation time ๐(๐) is constant
for low values of k, until a certain
inflection point ๐0 โ 50, and then
becomes affine for values ๐ > ๐0.
2. Empirically, ๐ ๐ = 0.40 ๐๐ on a K40 and
0.22 ms on a M40.
39. Efficient softmax approximation for GPUs [3]
๐ฑโ
๐ฑ๐ก
Notation
โข ๐ฑโ: the word set of head
โข ๐ฑ๐ก: the word set of tail
โข ๐โ = ๐ฑโ , ๐ ๐ก = ๐ฑ๐ก
โข ๐๐: the probability of a word to occur in the set ๐ฑ๐.
๐ถ = ๐ ๐โ + 1, ๐ต + ๐(๐ ๐ก, ๐๐ก ๐ต)
41. Efficient softmax approximation for GPUs [3]
๐๐ ๐๐ + ๐๐ ๐๐ = ๐๐ ๐๐ โ ๐๐ + ๐๐+๐ ๐๐ where ๐๐+๐ = ๐๐ + ๐๐
๐ถ = ๐ฝ + 1 ๐ + ๐๐ต ๐ฝ + ๐โ +
๐
๐๐ ๐๐
Assume that ๐๐ > ๐๐, and fix the quantities ๐๐+๐, ๐๐ and ๐๐.
The best strategy is trivially to minimize the probability of the largest cluster
๐ฑ๐.
For a fixed number of clusters of given sizes, the best strategy is to assign
the words by decreasing probabilities to cluster of increasing size.
47. References
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. CoRR, abs/1706.03762, 2017
[2] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language
models beyond a fixed-length context. CoRR, abs/1901.02860, 2019
[3] Edouard Grave, Armand Joulin, Moustapha Cisse, David Grangier, and Herve Jegou. Efficient
softmax approximation for gpus.
CoRR, abs/1609.04309, 2016
[4] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling.
CoRR, abs/1809.10853, 2018.