SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
2018.11.15.
AI Labs
NL.K team
김성현
BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding
Overview - BERT?
1P
• Bi-directional Encoder Representations from Transformers (BERT)[1] is a language model
based on fine pre-trained word representation using bi-directional Transformer [2]
• Fine pre-trained BERT language model + transfer learning → NLP application!
….
James Marshall "Jimi" Hendrix
was an American rock guitarist,
singer, and songwriter.
….
Who is Jimi Hendrix?
Pre-trained BERT
1 classification layer
"Jimi" Hendrix was an American rock guitarist, singer, and
songwriter.
[1] Jacob et al., 2018, arxiv [2] Ashish et al., 2017, arXiv [3] Rajpurkar et al., 2016, arXiv
SQuAD v1.1 dataset leaderboard [3]
Introduction – Language Model
• How to encode and decode the natural language? → Language model
2P
Encoder
Decoder
0101101001011
1111001011011
Machine
translation
Named
entity
TTS
MRC
STT
POS
tagging
James Marshall "Jimi" Hendrix
was an American rock guitarist,
singer, and songwriter.
Although his mainstream career
spanned only four years,
…
Who is Jimi Hendrix?
"Jimi" Hendrix was an American
rock guitarist, singer, and songwriter.
Language model
Introduction – Word Embedding
• Word embedding is a language modeling where words or phrases from the un-
labeled large corpus are mapped to vectors of real numbers
• Skip-gram word embedding model vectorizing a word using target words to
predict the surrounding words
3P
“Duct tape may works anywhere”
“duct”
“may”
“tape”
“work”
“anywhere”
Word One-hot-vector
“duct” [1, 0, 0, 0, 0]
“tape” [0, 1, 0, 0, 0]
“may” [0, 0, 1, 0, 0]
“work” [0, 0, 0, 1, 0]
“anywhere” [0, 0, 0, 0, 1]
Introduction – Word Embedding
• Visualization of word embedding: https://ronxin.github.io/wevi/
4P
• However, word embedding algorithm could not represent ‘context’ of natural
language
Introduction – Markov Model
• Markov model represents the context of natural language
• Bi-gram language model could calculate the probability of sentence based on
the Markov chain
5P
I don’t like rabbits turtles snails
I 0 0.33 0.66 0 0 0
don’t 0 0 1.0 0 0 0
like 0 0 0 0.33 0.33 0.33
rabbits 0 0 0 0 0 0
turtles 0 0 0 0 0 0
snails 0 0 0 0 0 0
“I like rabbits”
“I like turtles”
“I don’t like snails”
0.66 * 0.33 = 0.22
0.66 * 0.33 = 0.22
0.33 * 1.0 * 0.33 = 0.11
Introduction – Recurrent Neural Network
• Recurrent neural network (RNN) contains a node with directed cycle
6P
Current step input
Predicted next step
output
Current step
hidden layer
Previous step
hidden layer
One-hot-vector
Basic RNN model architecture An example to predict the next character
• However, RNN compute the target output in consecutive order
• Distance dependency problem
soft max
Introduction – Attention Mechanism
• Attention is motivated by how we pay attention to different regions of an image
or correlate words in one sentence
7P
Hidden state
Key
Query
Value
Introduction – Attention Mechanism
• Attention for neural machine translation (NMT) https://distill.pub/2016/augmented-rnns/#attentional-interfaces
8P
• Attention for speech to text (STT)
Introduction – Transformer [1]
9P
• Transformer architecture
[1] Ashish et al., 2017, arXiv
Introduction – Transformer
10P
• Transformer architecture
Introduction – Transformer
11P
• Transformer architecture
Introduction – Transformer
12P
• Transformer architecture
To predict next word and minimize the difference between label and output
Research Aims
• To improve Transformer based language model by proposing BERT
• To show that the fine-turning based language model based on pre-trained
representations achieves state-of-the-art performance on many natural
language processing task
13P
Input embedding layer
Transformer layer
Contextual representation of token
Methods
• Model architecture
14P
BertBASE BertLARGE
Transformer layer 12 24
Hidden state size 768 1024
Self-attention head 12 16
Total 110M 340M
Methods
• Corpus data for pre-train word embedding
• BooksCorpus (800M words)
• English Wikipedia (2,500M words without lists, tables and headers)
• 30,000 token vocabulary
• Data preprocessing for pre-train word embedding
• A word is separated as WordPiece [1-2] tokenizing
He likes playing → He likes play ##ing
• Make a ‘token sequence’ which two sentences packed together for pre-training
• ‘Token sequence’ with two sentences is constructed by pair of two sentences
15P
Example of two sentences token sequence
Classification
label token
Next sentence or Random chosen sentence (50%)
[1] Sennrich et al., 2016, ACL [2] Kudo, 2018, ACL
Methods
• Data preprocessing for pre-train word embedding
• Masked language model (MLM) masked some percentage of the input tokens at random
16P
Original token sequence
Randomly chosen token (15%)
Masking
[MASK]
Randomly replacing
hairy
Unchanging
80% 10% 10%
Example of two sentences token sequence
[MASK]
Methods
• The input embeddings is the sum of token embeddings, the segmentation embeddings
and the position embeddings
17P
[MASK]
512. . .
= IsNext or NotNext
Methods
• Training options
• Train batch size: 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch)
• Steps: 1M
• Epoch: 40 epochs
• Adam learning rate: 1e-4
• Weight decay: 0.01
• Drop out probability: 0.1
• Activation function: GELU
• Environmental setup
• BERTBASE: 4 Cloud TPUs (16 TPU chips total)
• BERTLARGE: 16 Cloud TPUs (64 TPU chips total) ≈ 72 P100 GPU
• Training time: 4 days
18P
Methods
• Experiments (total 11 NLP task)
• GLUE datasets
‒ MNLI: Multi-Genre Natural Language Inference
‒ To predict whether second sentence is an entailment, contradiction or neutral
‒ QQP: Quora Question Pairs
‒ To predict two questions are semantically equivalent
‒ QNLI: Question Natural Language Inference
‒ Question and Answering datasets
‒ SST-2: The Stanford Sentiment Treebank
‒ Single-sentence classification task from movie reviews with human annotations of their sentiment
‒ CoLA: The Corpus of Linguistic Acceptability
‒ Single-sentence classification to predict whether an English sentence is linguistically acceptable or not
‒ STS-B: The Semantic Textual Similarity Benchmark
‒ News headlines dataset with annotated score from 1 to 5 denoting how similar the two sentences are in
terms of semantic meaning
‒ MRPC: Microsoft Research Paraphrase Corpus
‒ Online news sources with human annotations for whether the sentences in the pair the semantically
equivalent
‒ RTE: Recognizing Textual Entailment
‒ Similar to MNLI, but with much less training data
‒ WNLI: Winograd NLI
‒ Small natural language inference dataset to predict sentence class
• SQuAD v1.1
• CoNLL 2003 Named Entity Recognition datasets
• SWAG: Situations With Adversarial Generations
‒ To decide among four choices the most plausible continuation sentence
19P
Methods
• Experiments (total 11 NLP task)
20P
Sentence pair classification Single sentence pair classification
Question and answering (SQuAD v1.1) Single sentence tagging
Results
• GLUE test results
21P
• SQuAD v1.1
Results
• Named Entity Recognition (CoNLL-2003)
22P
• SWAG
Conclussion
• BERT is undoubtedly a breakthrough in the use of machine learning for natural
language processing
• Bi-directional Transformer architecture enhances the natural language
processing performance
23P
Discussion
• English SQuAD v1.1 test
24P
• Korean BERT training
BERT En vocabulary BERT Ko vocabulary
BERT Korean model
감사합니다

Weitere ähnliche Inhalte

Was ist angesagt?

Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP) ASWINKP11
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 
Natural language processing
Natural language processingNatural language processing
Natural language processingBasha Chand
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lithium
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 

Was ist angesagt? (20)

Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP)
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
BERT
BERTBERT
BERT
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

Ähnlich wie BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Automatic Personality Prediction with Attention-based Neural Networks
Automatic Personality Prediction with Attention-based Neural NetworksAutomatic Personality Prediction with Attention-based Neural Networks
Automatic Personality Prediction with Attention-based Neural NetworksJinho Choi
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Ana Marasović
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...Matīss ‎‎‎‎‎‎‎  
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddingsRoelof Pieters
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
MixedLanguageProcessingTutorialEMNLP2019.pptx
MixedLanguageProcessingTutorialEMNLP2019.pptxMixedLanguageProcessingTutorialEMNLP2019.pptx
MixedLanguageProcessingTutorialEMNLP2019.pptxMariYam371004
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 

Ähnlich wie BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Automatic Personality Prediction with Attention-based Neural Networks
Automatic Personality Prediction with Attention-based Neural NetworksAutomatic Personality Prediction with Attention-based Neural Networks
Automatic Personality Prediction with Attention-based Neural Networks
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
MixedLanguageProcessingTutorialEMNLP2019.pptx
MixedLanguageProcessingTutorialEMNLP2019.pptxMixedLanguageProcessingTutorialEMNLP2019.pptx
MixedLanguageProcessingTutorialEMNLP2019.pptx
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Asr
AsrAsr
Asr
 

Mehr von Seonghyun Kim

코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구
코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구
코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구Seonghyun Kim
 
뇌의 정보처리와 멀티모달 인공지능
뇌의 정보처리와 멀티모달 인공지능뇌의 정보처리와 멀티모달 인공지능
뇌의 정보처리와 멀티모달 인공지능Seonghyun Kim
 
인공지능과 윤리
인공지능과 윤리인공지능과 윤리
인공지능과 윤리Seonghyun Kim
 
한국어 개체명 인식 과제에서의 의미 모호성 연구
한국어 개체명 인식 과제에서의 의미 모호성 연구한국어 개체명 인식 과제에서의 의미 모호성 연구
한국어 개체명 인식 과제에서의 의미 모호성 연구Seonghyun Kim
 
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터Seonghyun Kim
 
Backpropagation and the brain review
Backpropagation and the brain reviewBackpropagation and the brain review
Backpropagation and the brain reviewSeonghyun Kim
 
Theories of error back propagation in the brain review
Theories of error back propagation in the brain reviewTheories of error back propagation in the brain review
Theories of error back propagation in the brain reviewSeonghyun Kim
 
KorQuAD v1.0 참관기
KorQuAD v1.0 참관기KorQuAD v1.0 참관기
KorQuAD v1.0 참관기Seonghyun Kim
 
딥러닝 기반 자연어 언어모델 BERT
딥러닝 기반 자연어 언어모델 BERT딥러닝 기반 자연어 언어모델 BERT
딥러닝 기반 자연어 언어모델 BERTSeonghyun Kim
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...Seonghyun Kim
 
The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...
The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...
The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...Seonghyun Kim
 
Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...
Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...
Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...Seonghyun Kim
 
How Environment and Self-motion Combine in Neural Representations of Space
How Environment and Self-motion Combine in Neural Representations of SpaceHow Environment and Self-motion Combine in Neural Representations of Space
How Environment and Self-motion Combine in Neural Representations of SpaceSeonghyun Kim
 
Computational Cognitive Models of Spatial Memory in Navigation Space: A review
Computational Cognitive Models of Spatial Memory in Navigation Space: A reviewComputational Cognitive Models of Spatial Memory in Navigation Space: A review
Computational Cognitive Models of Spatial Memory in Navigation Space: A reviewSeonghyun Kim
 
Learning Anticipation via Spiking Networks: Application to Navigation Control
Learning Anticipation via Spiking Networks: Application to Navigation ControlLearning Anticipation via Spiking Networks: Application to Navigation Control
Learning Anticipation via Spiking Networks: Application to Navigation ControlSeonghyun Kim
 
A goal-directed spatial navigation model using forward trajectory planning ba...
A goal-directed spatial navigation model using forward trajectory planning ba...A goal-directed spatial navigation model using forward trajectory planning ba...
A goal-directed spatial navigation model using forward trajectory planning ba...Seonghyun Kim
 

Mehr von Seonghyun Kim (18)

코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구
코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구
코드 스위칭 코퍼스 기반 다국어 LLM의 지식 전이 연구
 
뇌의 정보처리와 멀티모달 인공지능
뇌의 정보처리와 멀티모달 인공지능뇌의 정보처리와 멀티모달 인공지능
뇌의 정보처리와 멀티모달 인공지능
 
인공지능과 윤리
인공지능과 윤리인공지능과 윤리
인공지능과 윤리
 
한국어 개체명 인식 과제에서의 의미 모호성 연구
한국어 개체명 인식 과제에서의 의미 모호성 연구한국어 개체명 인식 과제에서의 의미 모호성 연구
한국어 개체명 인식 과제에서의 의미 모호성 연구
 
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
 
Backpropagation and the brain review
Backpropagation and the brain reviewBackpropagation and the brain review
Backpropagation and the brain review
 
Theories of error back propagation in the brain review
Theories of error back propagation in the brain reviewTheories of error back propagation in the brain review
Theories of error back propagation in the brain review
 
KorQuAD v1.0 참관기
KorQuAD v1.0 참관기KorQuAD v1.0 참관기
KorQuAD v1.0 참관기
 
딥러닝 기반 자연어 언어모델 BERT
딥러닝 기반 자연어 언어모델 BERT딥러닝 기반 자연어 언어모델 BERT
딥러닝 기반 자연어 언어모델 BERT
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
 
챗봇의 역사
챗봇의 역사챗봇의 역사
챗봇의 역사
 
The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...
The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...
The hippocampo-cortical loop: Spatio-temporal learning and goal-oriented plan...
 
Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...
Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...
Computational Properties of the Hippocampus Increase the Efficiency of Goal-D...
 
How Environment and Self-motion Combine in Neural Representations of Space
How Environment and Self-motion Combine in Neural Representations of SpaceHow Environment and Self-motion Combine in Neural Representations of Space
How Environment and Self-motion Combine in Neural Representations of Space
 
Computational Cognitive Models of Spatial Memory in Navigation Space: A review
Computational Cognitive Models of Spatial Memory in Navigation Space: A reviewComputational Cognitive Models of Spatial Memory in Navigation Space: A review
Computational Cognitive Models of Spatial Memory in Navigation Space: A review
 
Learning Anticipation via Spiking Networks: Application to Navigation Control
Learning Anticipation via Spiking Networks: Application to Navigation ControlLearning Anticipation via Spiking Networks: Application to Navigation Control
Learning Anticipation via Spiking Networks: Application to Navigation Control
 
A goal-directed spatial navigation model using forward trajectory planning ba...
A goal-directed spatial navigation model using forward trajectory planning ba...A goal-directed spatial navigation model using forward trajectory planning ba...
A goal-directed spatial navigation model using forward trajectory planning ba...
 

Kürzlich hochgeladen

Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 

Kürzlich hochgeladen (20)

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1. 2018.11.15. AI Labs NL.K team 김성현 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • 2. Overview - BERT? 1P • Bi-directional Encoder Representations from Transformers (BERT)[1] is a language model based on fine pre-trained word representation using bi-directional Transformer [2] • Fine pre-trained BERT language model + transfer learning → NLP application! …. James Marshall "Jimi" Hendrix was an American rock guitarist, singer, and songwriter. …. Who is Jimi Hendrix? Pre-trained BERT 1 classification layer "Jimi" Hendrix was an American rock guitarist, singer, and songwriter. [1] Jacob et al., 2018, arxiv [2] Ashish et al., 2017, arXiv [3] Rajpurkar et al., 2016, arXiv SQuAD v1.1 dataset leaderboard [3]
  • 3. Introduction – Language Model • How to encode and decode the natural language? → Language model 2P Encoder Decoder 0101101001011 1111001011011 Machine translation Named entity TTS MRC STT POS tagging James Marshall "Jimi" Hendrix was an American rock guitarist, singer, and songwriter. Although his mainstream career spanned only four years, … Who is Jimi Hendrix? "Jimi" Hendrix was an American rock guitarist, singer, and songwriter. Language model
  • 4. Introduction – Word Embedding • Word embedding is a language modeling where words or phrases from the un- labeled large corpus are mapped to vectors of real numbers • Skip-gram word embedding model vectorizing a word using target words to predict the surrounding words 3P “Duct tape may works anywhere” “duct” “may” “tape” “work” “anywhere” Word One-hot-vector “duct” [1, 0, 0, 0, 0] “tape” [0, 1, 0, 0, 0] “may” [0, 0, 1, 0, 0] “work” [0, 0, 0, 1, 0] “anywhere” [0, 0, 0, 0, 1]
  • 5. Introduction – Word Embedding • Visualization of word embedding: https://ronxin.github.io/wevi/ 4P • However, word embedding algorithm could not represent ‘context’ of natural language
  • 6. Introduction – Markov Model • Markov model represents the context of natural language • Bi-gram language model could calculate the probability of sentence based on the Markov chain 5P I don’t like rabbits turtles snails I 0 0.33 0.66 0 0 0 don’t 0 0 1.0 0 0 0 like 0 0 0 0.33 0.33 0.33 rabbits 0 0 0 0 0 0 turtles 0 0 0 0 0 0 snails 0 0 0 0 0 0 “I like rabbits” “I like turtles” “I don’t like snails” 0.66 * 0.33 = 0.22 0.66 * 0.33 = 0.22 0.33 * 1.0 * 0.33 = 0.11
  • 7. Introduction – Recurrent Neural Network • Recurrent neural network (RNN) contains a node with directed cycle 6P Current step input Predicted next step output Current step hidden layer Previous step hidden layer One-hot-vector Basic RNN model architecture An example to predict the next character • However, RNN compute the target output in consecutive order • Distance dependency problem
  • 8. soft max Introduction – Attention Mechanism • Attention is motivated by how we pay attention to different regions of an image or correlate words in one sentence 7P Hidden state Key Query Value
  • 9. Introduction – Attention Mechanism • Attention for neural machine translation (NMT) https://distill.pub/2016/augmented-rnns/#attentional-interfaces 8P • Attention for speech to text (STT)
  • 10. Introduction – Transformer [1] 9P • Transformer architecture [1] Ashish et al., 2017, arXiv
  • 11. Introduction – Transformer 10P • Transformer architecture
  • 12. Introduction – Transformer 11P • Transformer architecture
  • 13. Introduction – Transformer 12P • Transformer architecture To predict next word and minimize the difference between label and output
  • 14. Research Aims • To improve Transformer based language model by proposing BERT • To show that the fine-turning based language model based on pre-trained representations achieves state-of-the-art performance on many natural language processing task 13P
  • 15. Input embedding layer Transformer layer Contextual representation of token Methods • Model architecture 14P BertBASE BertLARGE Transformer layer 12 24 Hidden state size 768 1024 Self-attention head 12 16 Total 110M 340M
  • 16. Methods • Corpus data for pre-train word embedding • BooksCorpus (800M words) • English Wikipedia (2,500M words without lists, tables and headers) • 30,000 token vocabulary • Data preprocessing for pre-train word embedding • A word is separated as WordPiece [1-2] tokenizing He likes playing → He likes play ##ing • Make a ‘token sequence’ which two sentences packed together for pre-training • ‘Token sequence’ with two sentences is constructed by pair of two sentences 15P Example of two sentences token sequence Classification label token Next sentence or Random chosen sentence (50%) [1] Sennrich et al., 2016, ACL [2] Kudo, 2018, ACL
  • 17. Methods • Data preprocessing for pre-train word embedding • Masked language model (MLM) masked some percentage of the input tokens at random 16P Original token sequence Randomly chosen token (15%) Masking [MASK] Randomly replacing hairy Unchanging 80% 10% 10% Example of two sentences token sequence [MASK]
  • 18. Methods • The input embeddings is the sum of token embeddings, the segmentation embeddings and the position embeddings 17P [MASK] 512. . . = IsNext or NotNext
  • 19. Methods • Training options • Train batch size: 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) • Steps: 1M • Epoch: 40 epochs • Adam learning rate: 1e-4 • Weight decay: 0.01 • Drop out probability: 0.1 • Activation function: GELU • Environmental setup • BERTBASE: 4 Cloud TPUs (16 TPU chips total) • BERTLARGE: 16 Cloud TPUs (64 TPU chips total) ≈ 72 P100 GPU • Training time: 4 days 18P
  • 20. Methods • Experiments (total 11 NLP task) • GLUE datasets ‒ MNLI: Multi-Genre Natural Language Inference ‒ To predict whether second sentence is an entailment, contradiction or neutral ‒ QQP: Quora Question Pairs ‒ To predict two questions are semantically equivalent ‒ QNLI: Question Natural Language Inference ‒ Question and Answering datasets ‒ SST-2: The Stanford Sentiment Treebank ‒ Single-sentence classification task from movie reviews with human annotations of their sentiment ‒ CoLA: The Corpus of Linguistic Acceptability ‒ Single-sentence classification to predict whether an English sentence is linguistically acceptable or not ‒ STS-B: The Semantic Textual Similarity Benchmark ‒ News headlines dataset with annotated score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning ‒ MRPC: Microsoft Research Paraphrase Corpus ‒ Online news sources with human annotations for whether the sentences in the pair the semantically equivalent ‒ RTE: Recognizing Textual Entailment ‒ Similar to MNLI, but with much less training data ‒ WNLI: Winograd NLI ‒ Small natural language inference dataset to predict sentence class • SQuAD v1.1 • CoNLL 2003 Named Entity Recognition datasets • SWAG: Situations With Adversarial Generations ‒ To decide among four choices the most plausible continuation sentence 19P
  • 21. Methods • Experiments (total 11 NLP task) 20P Sentence pair classification Single sentence pair classification Question and answering (SQuAD v1.1) Single sentence tagging
  • 22. Results • GLUE test results 21P • SQuAD v1.1
  • 23. Results • Named Entity Recognition (CoNLL-2003) 22P • SWAG
  • 24. Conclussion • BERT is undoubtedly a breakthrough in the use of machine learning for natural language processing • Bi-directional Transformer architecture enhances the natural language processing performance 23P
  • 25. Discussion • English SQuAD v1.1 test 24P • Korean BERT training BERT En vocabulary BERT Ko vocabulary BERT Korean model