SlideShare a Scribd company logo
1 of 77
BERT
Bidirectional Encoder
Representations from
TransformersMultimodal Dialogue Systems Seminar
Abdallah Bashir
CS UdS
19 Sep 2019
Saarbrucken
Germany 1
2
Outline
โ— Introduction
โ— Related Works
โ— BERT | The Model
โ— Pre-training BERT
โ— Experiments
โ— Ablation Studies
3
Keywords!
โ— Transformers
โ— Contextual Embeddings
โ— Bidirectional Training
โ— Fine-Tuning
โ— SOTA, alot of SOTA!
4
Outline
โ— Introduction
โ— Related Works
โ— BERT | The Model
โ— Pre-training BERT
โ— Experiments
โ— Ablation Studies
Introduction
โ— 2018 was a turning point for Natural Language
Processing
โ—‹ BERT
โ—‹ OpenAI GPT
โ—‹ ELMO
โ—‹ ULM fit
[1]
5
Introduction
โ— BERT [2] (Bidirectional Encoder Representations
from Transformers) is an Open-Source Language
Representation Model developed by researchers in
Google AI.
โ— At that time, the paper presented SOTA results in
eleven NLP tasks.
6Models that outperformed bert mentioned at the end.
Main Contributions
The paper summarizes the contributions in main three
points
โ— demonstrate the importance of Bidirectional pre-training
and why it is better than unidirectional approach.
โ— With the usage of fine-tuning, BERT outperforms task-
specific architectures.
โ— Showing the performance achievements where BERT
advances the SOTA for eleven NLP tasks 7
8
Outline
โ— Introduction
โ— Related Works
โ— BERT | The Model
โ— Pre-training BERT
โ— Experiments
โ— Ablation Studies
Related Work
โ— The paper briefly go through previous approaches
in pre-training general language representations,
which are listed under main three points:
โ—‹ Unsupervised Feature-based Approaches
โ—‹ Unsupervised Fine-tuning Approaches
โ—‹ Transfer Learning from Supervised Data
9
Unsupervised Feature-based
Approaches
โ— Approaches to Learning representations from
unlabeled text, i.e word embeddings included non-
neural approaches and neural approaches.
โ— Using pre-trained word-embeddings instead of
training it from scratch have proved significant
improvements in performance.
10
11
The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on
for two hundred values. [1]
Non Contextual Embeddings
โ— Unidirectional
โ—‹ Feature-Based(ELMo)
โ—‹ Fine-tuning(OpenAI GPT).
โ— Bidirectional
โ—‹ BERT
Contextual Embeddings
12
ELMO, What about sentencesโ€™
context
13
14
[16]
15
โ— When ELMo contextual representations was used
in task-specific architecture, ELMo advanced the
SOTA benchmarks.
16
17
Transformer
Unsupervised Fine-tuning
Approaches
18
19
20
21
22
23
Self-Attention in a Nutshell!
Unsupervised Fine-tuning
Approaches
โ— Perks of using such approaches that only few
parameters need to be trained from scratch.
โ— OpenAI GPT [4] previously achieved SOTA on
many sentence level tasks in GLUE benchmark [19]
24
Transfer Learning from
Supervised Data
โ— Transfer learning is a means to extract knowledge
from a source setting and apply it to a different
target setting.
โ— Researchers in the field of Computer Vision have
used transfer learning continuously where they
fine-tune models pre-trained with ImageNet. 25
26
27
Outlines
โ— Introduction
โ— Related Works
โ— BERT | The Model
โ— Pre-training BERT
โ— Experiments
โ— Ablation Studies
โ— BERT implementation:
โ—‹ Pretraining
โ—‹ Fine-tuning
BERT | The Model
28
[16]
29
โ— BERT almost have the same model architecture
across different tasks with small changes between
the pre-trained architecture and the final
downstream architecture
30
Model Architecture
โ— BERT architecture consist of multi-layer
bidirectional Transformer encoder [5].
โ— BERT model provided in the paper came in two
sizes
BERTBASE BERTLARGE
Layers = 12 Layers = 24
Hidden size = 768 Hidden size = 1024
self-Attention heads = 12 self-Attention heads = 16
Total parameters = 110M Total parameters = 340M
31
โ— BERTbase was chosen to have the same size as
OpenAI GPT for benchmarks purposes.
โ— The difference between the two models is BERT
train the Language Model transformers in both
directions while the OpenAI GPT context trains it
only left-to-right.
Comparisons with OpenAI GPT
32
Model Input/Output
Representation
โ— The paper defines a โ€˜sentenceโ€™ as any sequence of
text. Like mentioned before, it could be one
sentence or two sentences stacked together.
โ— BERT input can be represented by either a single
sentence, or pair of sentences (like Question
Answering).
โ— BERT use WordPiece embeddings [23] with
30,000 token vocabulary. 33
[24]
34
โ— The first token in each sequence is always a
[CLS] token that is used in classification.
โ— Separating Sentences apart can be done in two
steps:
โ—‹ Separate between sentences using [SEP] token,
โ—‹ Adding a learned embedding to each token in order to
state which sentence it belongs to.
Model Input/Output
Representation
35
โ— Each token is represented by summing the
corresponding token, segment and position
embedding as seen in the above figure
36
37
Outline
โ— Introduction
โ— Related Works
โ— BERT | The Model
โ— Pre-training BERT
โ— Experiments
โ— Ablation Studies
Pre-training BERT
โ— In order to pre-train BERT transformers
bidirectionally the paper proposed two
unsupervised tasks which are Masked LM and
Next Sentence Prediction(NSP).
38
Pre-training Data
โ— One of the reasons of BERT success is the
amount of data that it got trained on.
โ— The main two corpuses that were used to
pretrain the language models are:
โ—‹ the BooksCorpus (800M words)
โ—‹ English Wikipedia (2,500M words)
39
Masked Language Modeling
โ— Usually, Language Models can only be trained on
one specific direction.
โ— BERT handled this issue by using a โ€œMasked
Language Modelโ€ from previous literature (Cloze
[24]).
โ— BERT do this by randomly masking 15% of the
wordpiece input tokens. Only the masked tokens
get to be predicted later.
40
[16]
41
โ— [MASK] tokens would appear only in the pretraining
and not during the fine-tuning!
โ— To alleviate the masking, after choosing the randomly
15% tokens, BERT either:
โ—‹ Switch the token to [MASK] (80% of the time).
โ—‹ Switch the token to a random other token (10% of the time).
โ—‹ Leave the token unchanged (10% of the time).
โ— The original token is then predicted with cross entropy
loss.
Masked Language Modeling
42
Next Sentence Prediction (NSP)
โ— In the pretraining phase, the model (if provided)
receives pairs of sentences and it will be trained to
predict if the second sentence is the subsequent of
the first.
โ— In the training data, 50% of the inputs are actual pairs
with a label [IsNext], where the other 50% have
random sentences from the corpus as the successor
for the first sentence [NotNext].
โ— NSP is crucial for downstream tasks like Question
Answering and Natural Language Inference. 43
Next Sentence Prediction (NSP)
44
โ— this stage in total is considered to be inexpensive
relative to the pretraining phase.
โ— For classification tasks (e.g sentiment analysis) we
add a classification FFN for the [CLS] token input
representation on top of the final output.
โ— For Question Answering alike tasks Bert train two
extra vectors that are responsible for marking the
beginning and the end of the answer.
Fine-tuning BERT
45
Classification Example
[16]
46
47
Outline
โ— Introduction
โ— Related Works
โ— BERT | The Model
โ— Pre-training BERT
โ— Experiments
โ— Ablation Studies
Experiments
48
49
50
Experiments
โ— GLUE
โ—‹ General Language Understanding
Evaluation benchmark
โ— SQuAD v1.1
โ—‹ Stanford Question Answering Dataset
โ— SQuAD v2.0
โ—‹ extends the SQuAD 1.1
โ— SWAG
โ—‹ Situations With Adversarial Generations
(SWAG) dataset
1. GLUE
โ— โ€œThe General Language Understanding Evaluation
benchmark (GLUE) [19] is a collection of various
natural language understanding tasks.โ€
โ— For fine-tuning,the same approach of classification
in pre-training is used.
51
Table 1: GLUE Test results [2]
52
2. SQuAD v1.1
โ— โ€œSQuAD v1.1 [26], The Stanford Question
Answering Dataset is a collection of 100k
crowdsourced question,answer pairs.โ€
โ— Input get represented as one sequence containing
the question and the text containing the answer.
53
Table 2: SQuAD 1.1 results
2. SQuAD v1.1
54
โ— The SQuAD 2.0 task extends the SQuAD 1.1 by:
โ—‹ making sure that no short answer exists in the
provided paragraph.
โ—‹ marking questions that do not have an answer
with [CLS] token at the start and the end. The
TriviaQA data was used for this model.
3. SQuAD v2.0
55
Table 3: SQuAD 2.0 results
56
โ— โ€œThe Situations With Adversarial Generations
(SWAG) dataset contains 113k sentence-pair
completion examples that evaluate grounded
common sense inference [28],
โ— Given a sentence, the task is to choose the most
plausible continuation among four choicesโ€
โ— The paper fine-tune on the SWAG dataset by
creating four input sequences, each include the
given sentence and concatenated with a possible
continuation.
4. SWAG
57
Table 4: SWAG Resultโ€ซุณโ€ฌ
58
59
Outline
โ— Introduction
โ— Related Works
โ— BERT | The Model
โ— Pre-training BERT
โ— Experiments
โ— Ablation Studies
Ablation
Studies
60
1. Effect of Pre-training Tasks
โ— The next experiments will showcase the
importance of the bidirectionality of BERT.
โ— the same pretraining data, fine-tuning scheme, and
hyperparameters as BERTBASE will be used
throughout the experiments.
61
Experiments Description
No NSP bidirectional model trained only
using the Masked Language
Model (MLM) without the Next
Sentence Prediction NSP.
LTR & No NSP A standard Left-to-Right (LTR)
language model which is
trained without the MLM and
the NSP. this model can be
compared to the OpenAI GPT.
1. Effect of Pre-training Tasks
62
Table 5: Ablation over the pre-training tasks using the BERTBASE architecture [2]
63
โ— The effect of Bert model size on fine-tuning tasks
was tested with different number of layers, hidden
units, and attention heads while using the same
hyperparameters.
โ— Results from fine-tuning on GLUE are shown in
Table 6 which include the average Dev Set
accuracy.
โ— It is clear that the larger the model, the better the
accuracy.
2. Effect of Model Size
64
Table 6: Ablation over BERT model size [2]
65
โ— Bert shows for the first time that increasing the
model can improve the results even for
downstream tasks that its data is very small
because it benefit from the larger, more expressive
pre-trained representations.
66
3. Feature-based Approach with
BERT
โ— There are other ways to use Bert for downstream
tasks other than fine-tuning which is using the
contextualized word embeddings that are
generated from pre-training BERT, and then use
these fixed features in other models.
โ— Advantages of using such an approach:
โ—‹ Some tasks require a task-specific model architecture and
canโ€™t be modeled with a Transformer encoder
architecture. 67
โ— The two approaches were compared by applying Bert to
the CoNLL-2003 Named Entity Recognition (NER) task.
โ— These contextual embeddings are used as input to a
randomly initialized two-layer 768-dimensional BiLSTM
before the classification layer.
โ— The paper tried different approaches to represent the
embeddings (all shown in Table 7.) in the feature-based
approach. Concatenate the last four-hidden layers
output achieved the best.
3. Feature-based Approach with
BERT
68
Table 7: CoNLL-2003 Named Entity Recognition results [1]
69
In Conclusion
โ— Bert major contribution was in adding more
generalization to existing Transfer Learning
methods by using bidirectional architecture.
โ— Bert model also added more contribution to the
field. Fine-tuning Bert model will now tackle a lot
of NLP tasks.
70
71
Papers that outperformed Bert
โ— Roberta, FAIR
โ— Xlnet, CMU + Google AI
โ— CTRL, Salesforce Research
72
References
1. http://jalammar.github.io/illustrated-bert/
2. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
3. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In NAACL.
4. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised learning. Technical report, OpenAI.
5. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 6000โ€“6010.
6. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.
1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467โ€“
479.
7. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from
multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817โ€“1853.
73
8. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation
with structural correspondence learning. In Proceedings of the 2006 conference on empirical
methods in natural language processing, pages 120โ€“128. Association for Computational
Linguistics.
9. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp
Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in
statistical language modeling. arXiv preprint arXiv:1312.3005.
10. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
Global vectors for word representation. In Empirical Methods in Natural Language
Processing (EMNLP), pages 1532โ€“ 1543.
11. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A
simple and general method for semi-supervised learning. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics, ACL โ€™10, pages 384โ€“394.
12. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural
information processing systems, pages 3294โ€“3302.
13. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for
learning sentence representations. In International Conference on Learning Representations.
74
14. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
documents. In International Conference on Machine Learning, pages 1188โ€“1196.
15. Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based
objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557.
16. http://jalammar.github.io/illustrated-bert/
17. Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances
in neural information processing systems, pages 3079โ€“3087.
18. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for
text classification. In ACL. Association for Computational Linguistics.
19. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel
Bowman. 2018a. Glue: A multi-task benchmark and analysis platform for natural language
understanding. In Proceedings of the 2018 EMNLP Workshop Black box NLP: Analyzing and
Interpreting Neural Networks for NLP, pages 353โ€“355.
20. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loยจฤฑc Barrault, and Antoine Bordes.
2017. Supervised learning of universal sentence representations from natural language inference
data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
pages 670โ€“680, Copenhagen, Denmark. Association for Computational Linguistics.
75
21. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in
translation: Contextualized word vectors. In NIPS.
22. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. ImageNet: A Large-Scale
Hierarchical Image Database. In CVPR09.
23. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Googleโ€™s neural machine translation
system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
24. Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism
Bulletin, 30(4):415โ€“433
25. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching
movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19โ€“
27.
26. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+
questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383โ€“2392.
27. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale
distantly supervised challenge dataset for reading comprehension. In ACL.
76
28. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale
adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing (EMNLP).
29. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003
shared task: Language-independent named entity recognition. In CoNLL.
30. http://jalammar.github.io/illustrated-transformer/
77

More Related Content

What's hot

Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
ย 

What's hot (20)

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
ย 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
ย 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
ย 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
ย 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
ย 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
ย 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
ย 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
ย 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
ย 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
ย 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ย 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
ย 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
ย 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
ย 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
ย 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
ย 
NLP with Deep Learning
NLP with Deep LearningNLP with Deep Learning
NLP with Deep Learning
ย 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
ย 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
ย 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ย 

Similar to Bert

Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
taeseon ryu
ย 

Similar to Bert (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
ย 
Pay attention to MLPs
Pay attention to MLPsPay attention to MLPs
Pay attention to MLPs
ย 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
ย 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
ย 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
ย 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression
ย 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
ย 
Study_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_SystemsStudy_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_Systems
ย 
Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper review
ย 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
ย 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
ย 
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Machine learning testing  survey, landscapes and horizons, the Cliff NotesMachine learning testing  survey, landscapes and horizons, the Cliff Notes
Machine learning testing survey, landscapes and horizons, the Cliff Notes
ย 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
ย 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
ย 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
ย 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
ย 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
ย 
Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)
ย 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
ย 
nnUNet
nnUNetnnUNet
nnUNet
ย 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
UK Journal
ย 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
ย 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
ย 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
ย 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
ย 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
ย 
๐Ÿฌ The future of MySQL is Postgres ๐Ÿ˜
๐Ÿฌ  The future of MySQL is Postgres   ๐Ÿ˜๐Ÿฌ  The future of MySQL is Postgres   ๐Ÿ˜
๐Ÿฌ The future of MySQL is Postgres ๐Ÿ˜
ย 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
ย 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
ย 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
ย 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
ย 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
ย 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
ย 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
ย 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
ย 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
ย 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
ย 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
ย 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
ย 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ย 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
ย 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
ย 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
ย 

Bert

  • 1. BERT Bidirectional Encoder Representations from TransformersMultimodal Dialogue Systems Seminar Abdallah Bashir CS UdS 19 Sep 2019 Saarbrucken Germany 1
  • 2. 2 Outline โ— Introduction โ— Related Works โ— BERT | The Model โ— Pre-training BERT โ— Experiments โ— Ablation Studies
  • 3. 3 Keywords! โ— Transformers โ— Contextual Embeddings โ— Bidirectional Training โ— Fine-Tuning โ— SOTA, alot of SOTA!
  • 4. 4 Outline โ— Introduction โ— Related Works โ— BERT | The Model โ— Pre-training BERT โ— Experiments โ— Ablation Studies
  • 5. Introduction โ— 2018 was a turning point for Natural Language Processing โ—‹ BERT โ—‹ OpenAI GPT โ—‹ ELMO โ—‹ ULM fit [1] 5
  • 6. Introduction โ— BERT [2] (Bidirectional Encoder Representations from Transformers) is an Open-Source Language Representation Model developed by researchers in Google AI. โ— At that time, the paper presented SOTA results in eleven NLP tasks. 6Models that outperformed bert mentioned at the end.
  • 7. Main Contributions The paper summarizes the contributions in main three points โ— demonstrate the importance of Bidirectional pre-training and why it is better than unidirectional approach. โ— With the usage of fine-tuning, BERT outperforms task- specific architectures. โ— Showing the performance achievements where BERT advances the SOTA for eleven NLP tasks 7
  • 8. 8 Outline โ— Introduction โ— Related Works โ— BERT | The Model โ— Pre-training BERT โ— Experiments โ— Ablation Studies
  • 9. Related Work โ— The paper briefly go through previous approaches in pre-training general language representations, which are listed under main three points: โ—‹ Unsupervised Feature-based Approaches โ—‹ Unsupervised Fine-tuning Approaches โ—‹ Transfer Learning from Supervised Data 9
  • 10. Unsupervised Feature-based Approaches โ— Approaches to Learning representations from unlabeled text, i.e word embeddings included non- neural approaches and neural approaches. โ— Using pre-trained word-embeddings instead of training it from scratch have proved significant improvements in performance. 10
  • 11. 11 The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on for two hundred values. [1] Non Contextual Embeddings
  • 12. โ— Unidirectional โ—‹ Feature-Based(ELMo) โ—‹ Fine-tuning(OpenAI GPT). โ— Bidirectional โ—‹ BERT Contextual Embeddings 12
  • 13. ELMO, What about sentencesโ€™ context 13
  • 14. 14
  • 16. โ— When ELMo contextual representations was used in task-specific architecture, ELMo advanced the SOTA benchmarks. 16
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 24. Unsupervised Fine-tuning Approaches โ— Perks of using such approaches that only few parameters need to be trained from scratch. โ— OpenAI GPT [4] previously achieved SOTA on many sentence level tasks in GLUE benchmark [19] 24
  • 25. Transfer Learning from Supervised Data โ— Transfer learning is a means to extract knowledge from a source setting and apply it to a different target setting. โ— Researchers in the field of Computer Vision have used transfer learning continuously where they fine-tune models pre-trained with ImageNet. 25
  • 26. 26
  • 27. 27 Outlines โ— Introduction โ— Related Works โ— BERT | The Model โ— Pre-training BERT โ— Experiments โ— Ablation Studies
  • 28. โ— BERT implementation: โ—‹ Pretraining โ—‹ Fine-tuning BERT | The Model 28
  • 30. โ— BERT almost have the same model architecture across different tasks with small changes between the pre-trained architecture and the final downstream architecture 30
  • 31. Model Architecture โ— BERT architecture consist of multi-layer bidirectional Transformer encoder [5]. โ— BERT model provided in the paper came in two sizes BERTBASE BERTLARGE Layers = 12 Layers = 24 Hidden size = 768 Hidden size = 1024 self-Attention heads = 12 self-Attention heads = 16 Total parameters = 110M Total parameters = 340M 31
  • 32. โ— BERTbase was chosen to have the same size as OpenAI GPT for benchmarks purposes. โ— The difference between the two models is BERT train the Language Model transformers in both directions while the OpenAI GPT context trains it only left-to-right. Comparisons with OpenAI GPT 32
  • 33. Model Input/Output Representation โ— The paper defines a โ€˜sentenceโ€™ as any sequence of text. Like mentioned before, it could be one sentence or two sentences stacked together. โ— BERT input can be represented by either a single sentence, or pair of sentences (like Question Answering). โ— BERT use WordPiece embeddings [23] with 30,000 token vocabulary. 33
  • 35. โ— The first token in each sequence is always a [CLS] token that is used in classification. โ— Separating Sentences apart can be done in two steps: โ—‹ Separate between sentences using [SEP] token, โ—‹ Adding a learned embedding to each token in order to state which sentence it belongs to. Model Input/Output Representation 35
  • 36. โ— Each token is represented by summing the corresponding token, segment and position embedding as seen in the above figure 36
  • 37. 37 Outline โ— Introduction โ— Related Works โ— BERT | The Model โ— Pre-training BERT โ— Experiments โ— Ablation Studies
  • 38. Pre-training BERT โ— In order to pre-train BERT transformers bidirectionally the paper proposed two unsupervised tasks which are Masked LM and Next Sentence Prediction(NSP). 38
  • 39. Pre-training Data โ— One of the reasons of BERT success is the amount of data that it got trained on. โ— The main two corpuses that were used to pretrain the language models are: โ—‹ the BooksCorpus (800M words) โ—‹ English Wikipedia (2,500M words) 39
  • 40. Masked Language Modeling โ— Usually, Language Models can only be trained on one specific direction. โ— BERT handled this issue by using a โ€œMasked Language Modelโ€ from previous literature (Cloze [24]). โ— BERT do this by randomly masking 15% of the wordpiece input tokens. Only the masked tokens get to be predicted later. 40
  • 42. โ— [MASK] tokens would appear only in the pretraining and not during the fine-tuning! โ— To alleviate the masking, after choosing the randomly 15% tokens, BERT either: โ—‹ Switch the token to [MASK] (80% of the time). โ—‹ Switch the token to a random other token (10% of the time). โ—‹ Leave the token unchanged (10% of the time). โ— The original token is then predicted with cross entropy loss. Masked Language Modeling 42
  • 43. Next Sentence Prediction (NSP) โ— In the pretraining phase, the model (if provided) receives pairs of sentences and it will be trained to predict if the second sentence is the subsequent of the first. โ— In the training data, 50% of the inputs are actual pairs with a label [IsNext], where the other 50% have random sentences from the corpus as the successor for the first sentence [NotNext]. โ— NSP is crucial for downstream tasks like Question Answering and Natural Language Inference. 43
  • 45. โ— this stage in total is considered to be inexpensive relative to the pretraining phase. โ— For classification tasks (e.g sentiment analysis) we add a classification FFN for the [CLS] token input representation on top of the final output. โ— For Question Answering alike tasks Bert train two extra vectors that are responsible for marking the beginning and the end of the answer. Fine-tuning BERT 45
  • 47. 47 Outline โ— Introduction โ— Related Works โ— BERT | The Model โ— Pre-training BERT โ— Experiments โ— Ablation Studies
  • 49. 49
  • 50. 50 Experiments โ— GLUE โ—‹ General Language Understanding Evaluation benchmark โ— SQuAD v1.1 โ—‹ Stanford Question Answering Dataset โ— SQuAD v2.0 โ—‹ extends the SQuAD 1.1 โ— SWAG โ—‹ Situations With Adversarial Generations (SWAG) dataset
  • 51. 1. GLUE โ— โ€œThe General Language Understanding Evaluation benchmark (GLUE) [19] is a collection of various natural language understanding tasks.โ€ โ— For fine-tuning,the same approach of classification in pre-training is used. 51
  • 52. Table 1: GLUE Test results [2] 52
  • 53. 2. SQuAD v1.1 โ— โ€œSQuAD v1.1 [26], The Stanford Question Answering Dataset is a collection of 100k crowdsourced question,answer pairs.โ€ โ— Input get represented as one sequence containing the question and the text containing the answer. 53
  • 54. Table 2: SQuAD 1.1 results 2. SQuAD v1.1 54
  • 55. โ— The SQuAD 2.0 task extends the SQuAD 1.1 by: โ—‹ making sure that no short answer exists in the provided paragraph. โ—‹ marking questions that do not have an answer with [CLS] token at the start and the end. The TriviaQA data was used for this model. 3. SQuAD v2.0 55
  • 56. Table 3: SQuAD 2.0 results 56
  • 57. โ— โ€œThe Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded common sense inference [28], โ— Given a sentence, the task is to choose the most plausible continuation among four choicesโ€ โ— The paper fine-tune on the SWAG dataset by creating four input sequences, each include the given sentence and concatenated with a possible continuation. 4. SWAG 57
  • 58. Table 4: SWAG Resultโ€ซุณโ€ฌ 58
  • 59. 59 Outline โ— Introduction โ— Related Works โ— BERT | The Model โ— Pre-training BERT โ— Experiments โ— Ablation Studies
  • 61. 1. Effect of Pre-training Tasks โ— The next experiments will showcase the importance of the bidirectionality of BERT. โ— the same pretraining data, fine-tuning scheme, and hyperparameters as BERTBASE will be used throughout the experiments. 61
  • 62. Experiments Description No NSP bidirectional model trained only using the Masked Language Model (MLM) without the Next Sentence Prediction NSP. LTR & No NSP A standard Left-to-Right (LTR) language model which is trained without the MLM and the NSP. this model can be compared to the OpenAI GPT. 1. Effect of Pre-training Tasks 62
  • 63. Table 5: Ablation over the pre-training tasks using the BERTBASE architecture [2] 63
  • 64. โ— The effect of Bert model size on fine-tuning tasks was tested with different number of layers, hidden units, and attention heads while using the same hyperparameters. โ— Results from fine-tuning on GLUE are shown in Table 6 which include the average Dev Set accuracy. โ— It is clear that the larger the model, the better the accuracy. 2. Effect of Model Size 64
  • 65. Table 6: Ablation over BERT model size [2] 65
  • 66. โ— Bert shows for the first time that increasing the model can improve the results even for downstream tasks that its data is very small because it benefit from the larger, more expressive pre-trained representations. 66
  • 67. 3. Feature-based Approach with BERT โ— There are other ways to use Bert for downstream tasks other than fine-tuning which is using the contextualized word embeddings that are generated from pre-training BERT, and then use these fixed features in other models. โ— Advantages of using such an approach: โ—‹ Some tasks require a task-specific model architecture and canโ€™t be modeled with a Transformer encoder architecture. 67
  • 68. โ— The two approaches were compared by applying Bert to the CoNLL-2003 Named Entity Recognition (NER) task. โ— These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer. โ— The paper tried different approaches to represent the embeddings (all shown in Table 7.) in the feature-based approach. Concatenate the last four-hidden layers output achieved the best. 3. Feature-based Approach with BERT 68
  • 69. Table 7: CoNLL-2003 Named Entity Recognition results [1] 69
  • 70. In Conclusion โ— Bert major contribution was in adding more generalization to existing Transfer Learning methods by using bidirectional architecture. โ— Bert model also added more contribution to the field. Fine-tuning Bert model will now tackle a lot of NLP tasks. 70
  • 71. 71 Papers that outperformed Bert โ— Roberta, FAIR โ— Xlnet, CMU + Google AI โ— CTRL, Salesforce Research
  • 72. 72
  • 73. References 1. http://jalammar.github.io/illustrated-bert/ 2. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 3. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In NAACL. 4. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI. 5. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000โ€“6010. 6. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467โ€“ 479. 7. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817โ€“1853. 73
  • 74. 8. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120โ€“128. Association for Computational Linguistics. 9. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. 10. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532โ€“ 1543. 11. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL โ€™10, pages 384โ€“394. 12. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294โ€“3302. 13. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations. 74
  • 75. 14. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188โ€“1196. 15. Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557. 16. http://jalammar.github.io/illustrated-bert/ 17. Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079โ€“3087. 18. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In ACL. Association for Computational Linguistics. 19. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018a. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Black box NLP: Analyzing and Interpreting Neural Networks for NLP, pages 353โ€“355. 20. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loยจฤฑc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670โ€“680, Copenhagen, Denmark. Association for Computational Linguistics. 75
  • 76. 21. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS. 22. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09. 23. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Googleโ€™s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. 24. Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 30(4):415โ€“433 25. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19โ€“ 27. 26. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383โ€“2392. 27. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL. 76
  • 77. 28. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 29. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL. 30. http://jalammar.github.io/illustrated-transformer/ 77

Editor's Notes

  1. Transformers: a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease.
  2. Models that outperformed bert mentioned at the end
  3. ELMo [3] extract context features by using a bi-directional LSTM language model. Each token is represented by concatenating left-to-right and right-to-left representations.
  4. grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation)
  5. Transformers: a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Openai and bert depends on it.
  6. by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
  7. Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
  8. The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers The encoderโ€™s inputs first flow through a self-attention layer โ€“ a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
  9. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence
  10. an encoder receives a list of vectors as input. It processes this list by passing these vectors into a โ€˜self-attentionโ€™ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
  11. Say the following sentence is an input sentence we want to translate:โ€The animal didn't cross the street because it was too tiredโ€ What does โ€œitโ€ in this sentence refer to? Is it referring to the street or to the animal? Itโ€™s a simple question to a human, but not as simple to an algorithm. When the model is processing the word โ€œitโ€, self-attention allows it to associate โ€œitโ€ with โ€œanimalโ€. As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word. Self-attention is the method the Transformer uses to bake the โ€œunderstandingโ€ of other relevant words into the one weโ€™re currently processing.
  12. 4:openai gpt 17.Semi-supervised sequence learning. In Advances in neural information processing systems, 18.Universal language model fine-tuning for text classificatio
  13. Previous work ,Computer vision
  14. input/output, transformers
  15. Pretraining: the model at first is trained on large unlabeled data. Fine-tuning: model is initialised with the embeddings that was trained in the previous step. Afterwards all the parameters are fine-tuned using the labeled data from the chosen downstream task.
  16. Two aizes
  17. Mention roberta
  18. The Next Sentence Prediction NSP task in the paper is related to [13] and [15], the only difference that [13] and [15] transfer only sentence embeddings to downstream tasks where BERT transfer all the parameters to the various NLP downstream tasks.
  19. In the next slides BERT fine-tuning results on 11 NLP tasks will be discussed
  20. Fine-tuning: we go through the same steps as earlier, the final hidden vector C which corresponds to the first input token [CLS] and get passed to a FFN which is used for classification
  21. We can see from Table 1. That both BERTBASE and BERTLARGE outperform all previous state-of-the-art with 4.5% and 7.0% respective average accuracy improvements. Another finding that BERTLARGE outperforms BERTBASE across all tasks
  22. Probability of wordi Pi to be the first of the sentence is calculated by using softmax over all other words in the text sequence Ti : the output vector S: start vector The same formula is used to predict the word that is at the end of the sentence
  23. Bert Large achieved SOTA in performance in the SQuAD. The model that achieved the highest score was ensemble of Bertlarge model with augmenting the dataset with TriviaQA [27] before fine-tuning with SQuAD. โ€œThe best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system.โ€
  24. From Table 3. We can see that Bert achieved a +5.1 F1 improvement over the previous best systems
  25. โ€œ BERTLARGE outperforms the authorsโ€™ baseline ESIM+ELMo system by +27.1% and OpenAI GPT by 8.3%โ€
  26. The paper have performed numerous useful ablation studies over different aspects of the Bert architecture in order to understand why are they important.
  27. But bert was trained with larger training dataset, our input representation, and our fine-tuning scheme.
  28. In Table 5. We can see that removing NSP reduces the performance especially in QNLI, MNLI, and SQuAD 1.1.
  29. #L = the number of layers; #H = hidden size; #A = number of attention heads. โ€œLM (ppl)โ€ is the masked LM perplexity of held-out training data. Using BERT large improved performance from BERT base in GLUE selected tasks even if BERT base already had a great number of parameters (110M) compared to the largest tested model in Transformer (100M).
  30. Computational efficient. Using the already pre-trained contextual embeddings with relatively small models is considered to be less expensive than fine-tuning.
  31. BERTLARGE performs competitively with state-of-the-art methods the best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model This demonstrates that BERT is effective for both finetuning and feature-based approaches.