Attention Is All You Need

•Download as PPTX, PDF•

3 likes•7,034 views

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Software

Attention Is All You Need
Presenter: Illia Polosukhin, NEAR.ai
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia
Polosukhin
Work performed while at Google

● RNNs have
transformed NLP
● State-of-the-art
across many tasks
● Translation has been
recent example of a
large win
Deep Learning for NLP
https://research.googleblog.com/2017/04/introducing-tf-seq2seq-open-source.html

Sequence To Sequence
“Sequence to Sequence Learning with Neural Networks”, Sutskever et al. 2014

● Hard to parallelization efficiently
● Back propagation through sequence
● Transmitting local and global information
through one bottleneck [hidden state]
Problem with RNNs

● Trying to solve the problems
with Sequence models
● Notable work:
○ Neural GPU
○ ByteNet
○ ConvS2S
● Limited by size of convolution
Convolutional Models
Neural Machine Translation in Linear Time, Kalchbrenner et al.

● Removes bottleneck of
Encoder-Decoder model
● Provides context for given
decoder step
Attention Mechanics
“Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et
al.

● “Inner Attention based Recurrent Neural Networks for Answer Selection”, ACL
2016, Wang et al.
● “Learning Natural Language Inference using Bidirectional LSTM model and
Inner-Attention”, 2016, Liu et al.
● “Long Short-Term Memory-Networks for Machine Reading”, EMNLP 2016,
Cheng et al.
● “A Decomposable Attention Model for Natural Language Inference”, EMNLP
2016, Parikh et al.
Self/Intra/Inner Attention in Literature

● Encoder: 6 layers of
self-attention + feed-
forward network
● Decoder: 6 layers of
masked self-attention
and output of encoder +
feed-forward.
Transformer architecture

Scaled Dot Product and Multi-Head Attention

● Positional encoding provides relative or absolute position
of given token
● Many options to select positional encoding, in this work:
Fixed offset PEpos+k can be represented as linear function of PEpos
● Alternative, to learn positional embeddings
Positional Encoding

Illia Polosukhin
NEAR.AI
@ilblackdragon, illia@near.ai
Questions?
Check out:
https://github.com/tensorflow/tensor2tensor
https://research.googleblog.com/2017/08/transformer-novel-neural-network.html
http://medium.com/@ilblackdragon

What's hot

GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim

Transformer Introduction (Seminar Material)Yuta Niki

An introduction to the Transformers architecture and BERTSuman Debnath

Word embeddings, RNN, GRU and LSTMDivya Gera

Introduction to Transformer ModelNuwan Sriyantha Bandara

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham

Seq2Seq (encoder decoder) model佳蓉倪

Attention Mechanism in Language Understanding and its ApplicationsArtifacia

Brief intro : Invariance and Equivariance홍배 김

Self-Attention with Linear ComplexitySangwoo Mo

"Attention Is All You Need" presented by Maroua Maachou (Veepee)Paris Women in Machine Learning and Data Science

BertAbdallah Bashir

Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev

Transformers AI PPT.pptxRahulKumar854607

BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong

Recurrent Neural Networks, LSTM and GRUananth

Transformers in 2021Grigory Sapunov

Deep learning for NLP and TransformerArvind Devaraj

BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu

PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee

What's hot (20)

GPT-2: Language Models are Unsupervised Multitask Learners

Transformer Introduction (Seminar Material)

An introduction to the Transformers architecture and BERT

Word embeddings, RNN, GRU and LSTM

Introduction to Transformer Model

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Seq2Seq (encoder decoder) model

Attention Mechanism in Language Understanding and its Applications

Brief intro : Invariance and Equivariance

Self-Attention with Linear Complexity

"Attention Is All You Need" presented by Maroua Maachou (Veepee)

Bert

Introduction to Transformers for NLP - Olga Petrova

Transformers AI PPT.pptx

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Recurrent Neural Networks, LSTM and GRU

Transformers in 2021

Deep learning for NLP and Transformer

BERT: Bidirectional Encoder Representations from Transformers

PR-231: A Simple Framework for Contrastive Learning of Visual Representations

Similar to Attention Is All You Need

Intro to Neural NetworksDean Wyatte

Deep Learning for Machine TranslationMatīss ‎‎‎‎‎‎‎

Use CNN for Sequence ModelingDongang (Sean) Wang

Deep Learning, Where Are You Going?NAVER Engineering

WaveNetAbeyHurtis

Convolutional Neural Networks for Natural Language Processing / Stanford cs22...changedaeoh

Intro to TensorFlow and PyTorch Workshop at Tubular LabsKendall

Natural Language to Visualization by Neural Machine Translationivaderivader

Lecture on Deep LearningYasas Senarath

Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...Lviv Data Science Summer School

Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi

Introduction to deep learningAbhishek Bhandwaldar

State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain

Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz

BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan

Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik

DLD meetup 2017, Efficient Deep LearningBrodmann17

2021.06. Ceph Project UpdateCeph Community

Deep Learning and Automatic Differentiation from Theano to PyTorchinside-BigData.com

Convolutional Neural Network and RNN for OCR problem.Vishal Mishra

Similar to Attention Is All You Need (20)

Intro to Neural Networks

Deep Learning for Machine Translation

Use CNN for Sequence Modeling

Deep Learning, Where Are You Going?

WaveNet

Convolutional Neural Networks for Natural Language Processing / Stanford cs22...

Intro to TensorFlow and PyTorch Workshop at Tubular Labs

Natural Language to Visualization by Neural Machine Translation

Lecture on Deep Learning

Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...

Deep Learning in Recommender Systems - RecSys Summer School 2017

Introduction to deep learning

State of the art time-series analysis with deep learning by Javier Ordóñez at...

Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)

BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2

Engineering Intelligent NLP Applications Using Deep Learning – Part 2

DLD meetup 2017, Efficient Deep Learning

2021.06. Ceph Project Update

Deep Learning and Automatic Differentiation from Theano to PyTorch

Convolutional Neural Network and RNN for OCR problem.

Recently uploaded

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Project Based Learning (A.I).pptx detail explanationkaushalgiri8080

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Introduction to Decentralized Applications (dApps)Intelisync

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Asset Management Software - InfographicHr365.us smith

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

EY_Graph Database Powered SustainabilityNeo4j

TECUNIQUE: Success Stories: IT Service providermohitmore19

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Recently uploaded (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Project Based Learning (A.I).pptx detail explanation

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Hand gesture recognition PROJECT PPT.pptx

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Introduction to Decentralized Applications (dApps)

5 Signs You Need a Fashion PLM Software.pdf

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Exploring iOS App Development: Simplifying the Process

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

Asset Management Software - Infographic

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

EY_Graph Database Powered Sustainability

TECUNIQUE: Success Stories: IT Service provider

Salesforce Certified Field Service Consultant

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Engage Usergroup 2024 - The Good The Bad_The Ugly

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Attention Is All You Need

1. Attention Is All You Need Presenter: Illia Polosukhin, NEAR.ai Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin Work performed while at Google

2. ● RNNs have transformed NLP ● State-of-the-art across many tasks ● Translation has been recent example of a large win Deep Learning for NLP https://research.googleblog.com/2017/04/introducing-tf-seq2seq-open-source.html

3. Sequence To Sequence “Sequence to Sequence Learning with Neural Networks”, Sutskever et al. 2014

4. ● Hard to parallelization efficiently ● Back propagation through sequence ● Transmitting local and global information through one bottleneck [hidden state] Problem with RNNs

5. ● Trying to solve the problems with Sequence models ● Notable work: ○ Neural GPU ○ ByteNet ○ ConvS2S ● Limited by size of convolution Convolutional Models Neural Machine Translation in Linear Time, Kalchbrenner et al.

6. ● Removes bottleneck of Encoder-Decoder model ● Provides context for given decoder step Attention Mechanics “Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et al.

7. ● “Inner Attention based Recurrent Neural Networks for Answer Selection”, ACL 2016, Wang et al. ● “Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention”, 2016, Liu et al. ● “Long Short-Term Memory-Networks for Machine Reading”, EMNLP 2016, Cheng et al. ● “A Decomposable Attention Model for Natural Language Inference”, EMNLP 2016, Parikh et al. Self/Intra/Inner Attention in Literature

8. Why self attention?

10.

11. ● Encoder: 6 layers of self-attention + feed- forward network ● Decoder: 6 layers of masked self-attention and output of encoder + feed-forward. Transformer architecture

12. Scaled Dot Product and Multi-Head Attention

13. ● Positional encoding provides relative or absolute position of given token ● Many options to select positional encoding, in this work: Fixed offset PEpos+k can be represented as linear function of PEpos ● Alternative, to learn positional embeddings Positional Encoding

14. Results

15.

16.

17.

18.

19.

20. Constituency Parsing

21. Illia Polosukhin NEAR.AI @ilblackdragon, illia@near.ai Questions? Check out: https://github.com/tensorflow/tensor2tensor https://research.googleblog.com/2017/08/transformer-novel-neural-network.html http://medium.com/@ilblackdragon

Editor's Notes

“Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et al. Multiplicative interaction [Hinton], attention has all that, long range and no bottleneck
Self attention is a learned pooling, multiplicative interaction In all works except “Decomposable attention” it’s used in conjunction with RNN model.
The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by √ dk, and apply a softmax function to obtain the weights on the values. Do this to fix small gradients issue when dimensions of the vectors are large. Combine pieces from different parts of sub-space Multiple attention distribution, can focus on different and on smaller encoding to reduce computational
Learned works as well
An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads.
Anaphora resolution
Sentence structure visualization from attention
Sentence structure visualization from attention

Attention Is All You Need

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Attention Is All You Need

Similar to Attention Is All You Need (20)

Recently uploaded

Recently uploaded (20)

Attention Is All You Need

Editor's Notes