SlideShare ist ein Scribd-Unternehmen logo
1 von 26
BERT: Bidirectional
Encoder Representation
from Transformer
By: Shaurya Uppal
Defining Language
Language:- Divided into 3 Parts
● Syntax
● Semantics
● Pragmatics
Syntax- Word Ordering, Sentence form
Semantics- Meaning of word
Pragmatics- refers to the social language skills that
we use in our daily interactions with others.
Example of Syntax, Semantics, Pragmatics
+ This discussion is about BERT.
+ The green frogs sleep soundly.
+ BERT play football good
SSP
SS
None
Why study about BERT?
Bert has ability to perform state of the art performance in many Natural Language
Processing Tasks. It can perform tasks such as Text Classification, Text Similarity
finding, Next Sentence Sequence Prediction, Question Answering, Auto-
Summarization, Named Entity Recognition,etc.
What is BERT Exactly?
BERT is Pretrained model by Google, which is a
bidirectional representation from unlabeled text by
jointly conditioning on both left and right context in
all layers.
Dataset used to Pre-train BERT
+ BooksCorpus (800M words)
+ English Wikipedia (2,500+M words)
A pretrained model can be applied by feature-based approach or fine tuning.
+ In Fine Tuning all weights change.
+ In Feature based approach only the final layer weights change. (Approach by
ELMo)
This pretrained model is then fine tuned on different NLP tasks.
Pretraining and Fine Tuning: You train a model m on Data A, then this model m is
trained on Data B from the checkpoint. SLIDE 17
Language
Training
Approach
To train a language model:
Two approaches
Context Free
+ Traditionally we use to convert word2vec
or use Glove.
Contextual
+ RNN
+ BERT
How does BERT Work?
BERT weights are learned in advance through two unsupervised tasks: masked
language modeling (predicting a missing word given the left and right context)
and next sentence prediction (predicting whether one sentence follows another).
BERT makes use of Transformer, an attention mechanism that learns contextual
relations between words (or sub-words) in a text.
(Paper 2 Attention is all you need)
Multi-headed attention is used in BERT. It uses multiple layers of attention and
also incorporates multiple attention “heads” in every layer (12 or 16). Since model
weights are not shared between layers, a single BERT model effectively has up to
12 x 12 = 144 different attention mechanisms.
What does BERT learn, how it tokenize and handle
OOV?
Consider the input example:- I went to the store. At the store, I bought fresh
strawberries.
BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word
into segments. For example, if play, ##ing, and ##ed are present in the
vocabulary but playing and played are OOV words then they will be broken down
into play + ##ing and play + ##ed respectively. (## is used to represent sub-
words).
BERT also requires a [CLS] special classifier token at beginning and [SEP] at end
of a sequence.
[CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]
Attention
An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by
concatenating the entries aij, in every attention matrix from every attention head in every layer.
Some visual Attention Patterns and Why we use
Attention Mechanism?
Reason for Attention: Attention helps in two main
tasks of BERT, MLM (Masked Language Model)
and NSP (Next Sentence Prediction).
Visual Pattern from Attention mechanism
● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN
● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN
● Attention to identical/related word.
● Attention to identical words in other sentence. | Helps in nextsentence prediction task
● Attention to other words predictive of word.
● Attention to delimiter tokens [CLS], [SEP]
● Attention to Bag of Words.
How input in BERT is Feeded?
MLM: Masked Language Model
Input: My dog is hairy.
Masking is done randomly, and 15% of all WordPiece tokens in each input
sequence in masked. We only predict the masked tokens rather than predicting
the entire input sequence.
Procedure:
+ 80% of the time: Replace the word with [MASK]. My dog is [MASK].
+ 10% of time: Replace word randomly. My dog is apple.
+ 10% of time: Keep same My dog is hairy.
Why MLM is best?
NSP: Next Sentence Prediction
Training Method:
In unlabelled data, we take a input
sequence A and 50% of time
making next occurring input
sequence as B. Rest 50% of time
we randomly pick any sequence as
B.
BERT Architecture
BERT is a multi-layer bidirectional Transformer encoder.
There are two models introduced in the paper.
● BERT base – 12 layers (transformer blocks), 12
attention heads, and 110 million parameters.
● BERT Large – 24 layers, 16 attention heads and, 340
million parameters.
BERT Pretraining and Fine Tuning Architecture
Illustration how the BERT Pretrain
architecture remain the same and
just the fine tuning layer architecture
change for different NLP tasks.
Related Work
EMLo:- A pretrained model based which is feature based (only final layer weights
change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention
based model with positional encodings to represent word positions). ELMo also failed because is was
word based and could not handle OOV.
OpenAI GPT: uses a left to right architecture where every token can only attend to
previous tokens in the self-attention layer of Transformer. Failed because it could
not get proper contextual knowledge.
How BERT Outperforms others?
In the paper Visualizing and Measuring the Geometry of BERT, we prove how
BERT holds semantic and syntax features of a text.
In this paper aims to show how attention matrix contains grammatical
representations. Turning to semantics, using visualizations of the activations
created by different pieces of text, we show suggestive evidence that BERT
distinguishes word senses at a very fine level.
BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a
pre-trained word piece embedding with position and segment information. Next, this initial sequence of
embeddings is run through multiple transformer layers, producing a new sequence of context embeddings
at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head,
each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]
Experiment for Syntax Representation
Experiment on corpus of Penn TreeBank (3.1M dependency relations). With
PyStanford Dependency Library we found the grammatical dependency on which
we ran BERT-base through each sentence and obtained model-wide attention
matrix. [ SLIDE 9].
On this dataset we train test split of 30% and achieve an accuracy of 85.8% on
binary probe and 71.9% on multiclass probe.
Proved: Attention mechanism contains syntactic features.
Geometry of Word Sense (Experiment)
On wikipedia articles with a query word we applied nearest-neighbor classifier
where each neighbour is the centroid of a given word sense’s BERT-base
embeddings in training data.
Conclusion
BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language
Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide
range of practical applications in the future.
Tested on our data of SupportLen for Text Classification.
We have a priority column in supportlen where we manually label
whether a customer email is urgent or not.
On this dataset we used BERT-base-uncased.
Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}
Some FAQs on BERT
1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT?
512 tokens
2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING
● Dropout – 0.1
● Batch Size – 16, 32
● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5
● Number of epochs – 3, 4
3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP?
No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained
simultaneously.
4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE?
Google took 4days to Pretrain BERT with 16TPUs.
ULMFiT: Universal Language Model Fine-tuning for Text Classification
ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on
text from the same domain as the target task. Now, along with BERT Pretrained model classification task
is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our
custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.
Future work and use cases that BERT can solve for us
+ Email Prioritization
+ Sentiment Analysis of Reviews
+ Review Tagging
+ Question-Answering for ChatBot & Community
+ Similar Products problem, we currently use cosine similarity on description
text.
Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain
dataset.

Weitere ähnliche Inhalte

Was ist angesagt?

1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 

Was ist angesagt? (20)

1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
BERT
BERTBERT
BERT
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Word embedding
Word embedding Word embedding
Word embedding
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Transformers
TransformersTransformers
Transformers
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 

Ähnlich wie NLP State of the Art | BERT

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfhelloworld28847
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGijasuc
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Modelssaurav singla
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.Vladimir Ulogov
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representationszperjaccico
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfsudeshnakundu10
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratchMahmoud Yasser
 
visualizing and measuring the geometry of bert
visualizing and measuring the geometry of bertvisualizing and measuring the geometry of bert
visualizing and measuring the geometry of berttaeseon ryu
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysisBarry DeCicco
 

Ähnlich wie NLP State of the Art | BERT (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
 
Nltk
NltkNltk
Nltk
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.
 
NLP
NLPNLP
NLP
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdf
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Khmer TTS
Khmer TTSKhmer TTS
Khmer TTS
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratch
 
visualizing and measuring the geometry of bert
visualizing and measuring the geometry of bertvisualizing and measuring the geometry of bert
visualizing and measuring the geometry of bert
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
 
Y24168171
Y24168171Y24168171
Y24168171
 

Kürzlich hochgeladen

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Kürzlich hochgeladen (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

NLP State of the Art | BERT

  • 1. BERT: Bidirectional Encoder Representation from Transformer By: Shaurya Uppal
  • 2. Defining Language Language:- Divided into 3 Parts ● Syntax ● Semantics ● Pragmatics Syntax- Word Ordering, Sentence form Semantics- Meaning of word Pragmatics- refers to the social language skills that we use in our daily interactions with others.
  • 3. Example of Syntax, Semantics, Pragmatics + This discussion is about BERT. + The green frogs sleep soundly. + BERT play football good SSP SS None
  • 4. Why study about BERT? Bert has ability to perform state of the art performance in many Natural Language Processing Tasks. It can perform tasks such as Text Classification, Text Similarity finding, Next Sentence Sequence Prediction, Question Answering, Auto- Summarization, Named Entity Recognition,etc. What is BERT Exactly? BERT is Pretrained model by Google, which is a bidirectional representation from unlabeled text by jointly conditioning on both left and right context in all layers.
  • 5. Dataset used to Pre-train BERT + BooksCorpus (800M words) + English Wikipedia (2,500+M words) A pretrained model can be applied by feature-based approach or fine tuning. + In Fine Tuning all weights change. + In Feature based approach only the final layer weights change. (Approach by ELMo) This pretrained model is then fine tuned on different NLP tasks. Pretraining and Fine Tuning: You train a model m on Data A, then this model m is trained on Data B from the checkpoint. SLIDE 17
  • 6. Language Training Approach To train a language model: Two approaches Context Free + Traditionally we use to convert word2vec or use Glove. Contextual + RNN + BERT
  • 7. How does BERT Work? BERT weights are learned in advance through two unsupervised tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another). BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. (Paper 2 Attention is all you need) Multi-headed attention is used in BERT. It uses multiple layers of attention and also incorporates multiple attention “heads” in every layer (12 or 16). Since model weights are not shared between layers, a single BERT model effectively has up to 12 x 12 = 144 different attention mechanisms.
  • 8. What does BERT learn, how it tokenize and handle OOV? Consider the input example:- I went to the store. At the store, I bought fresh strawberries. BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word into segments. For example, if play, ##ing, and ##ed are present in the vocabulary but playing and played are OOV words then they will be broken down into play + ##ing and play + ##ed respectively. (## is used to represent sub- words). BERT also requires a [CLS] special classifier token at beginning and [SEP] at end of a sequence. [CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]
  • 9. Attention An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by concatenating the entries aij, in every attention matrix from every attention head in every layer.
  • 10. Some visual Attention Patterns and Why we use Attention Mechanism? Reason for Attention: Attention helps in two main tasks of BERT, MLM (Masked Language Model) and NSP (Next Sentence Prediction).
  • 11. Visual Pattern from Attention mechanism ● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN ● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN ● Attention to identical/related word. ● Attention to identical words in other sentence. | Helps in nextsentence prediction task ● Attention to other words predictive of word. ● Attention to delimiter tokens [CLS], [SEP] ● Attention to Bag of Words.
  • 12. How input in BERT is Feeded?
  • 13. MLM: Masked Language Model Input: My dog is hairy. Masking is done randomly, and 15% of all WordPiece tokens in each input sequence in masked. We only predict the masked tokens rather than predicting the entire input sequence. Procedure: + 80% of the time: Replace the word with [MASK]. My dog is [MASK]. + 10% of time: Replace word randomly. My dog is apple. + 10% of time: Keep same My dog is hairy.
  • 14. Why MLM is best?
  • 15. NSP: Next Sentence Prediction Training Method: In unlabelled data, we take a input sequence A and 50% of time making next occurring input sequence as B. Rest 50% of time we randomly pick any sequence as B.
  • 16. BERT Architecture BERT is a multi-layer bidirectional Transformer encoder. There are two models introduced in the paper. ● BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters. ● BERT Large – 24 layers, 16 attention heads and, 340 million parameters.
  • 17. BERT Pretraining and Fine Tuning Architecture
  • 18. Illustration how the BERT Pretrain architecture remain the same and just the fine tuning layer architecture change for different NLP tasks.
  • 19. Related Work EMLo:- A pretrained model based which is feature based (only final layer weights change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention based model with positional encodings to represent word positions). ELMo also failed because is was word based and could not handle OOV. OpenAI GPT: uses a left to right architecture where every token can only attend to previous tokens in the self-attention layer of Transformer. Failed because it could not get proper contextual knowledge.
  • 20. How BERT Outperforms others? In the paper Visualizing and Measuring the Geometry of BERT, we prove how BERT holds semantic and syntax features of a text. In this paper aims to show how attention matrix contains grammatical representations. Turning to semantics, using visualizations of the activations created by different pieces of text, we show suggestive evidence that BERT distinguishes word senses at a very fine level. BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a pre-trained word piece embedding with position and segment information. Next, this initial sequence of embeddings is run through multiple transformer layers, producing a new sequence of context embeddings at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head, each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]
  • 21. Experiment for Syntax Representation Experiment on corpus of Penn TreeBank (3.1M dependency relations). With PyStanford Dependency Library we found the grammatical dependency on which we ran BERT-base through each sentence and obtained model-wide attention matrix. [ SLIDE 9]. On this dataset we train test split of 30% and achieve an accuracy of 85.8% on binary probe and 71.9% on multiclass probe. Proved: Attention mechanism contains syntactic features.
  • 22. Geometry of Word Sense (Experiment) On wikipedia articles with a query word we applied nearest-neighbor classifier where each neighbour is the centroid of a given word sense’s BERT-base embeddings in training data.
  • 23. Conclusion BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. Tested on our data of SupportLen for Text Classification. We have a priority column in supportlen where we manually label whether a customer email is urgent or not. On this dataset we used BERT-base-uncased. Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 28996 }
  • 24. Some FAQs on BERT 1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT? 512 tokens 2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING ● Dropout – 0.1 ● Batch Size – 16, 32 ● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5 ● Number of epochs – 3, 4 3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP? No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained simultaneously. 4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE? Google took 4days to Pretrain BERT with 16TPUs.
  • 25. ULMFiT: Universal Language Model Fine-tuning for Text Classification ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on text from the same domain as the target task. Now, along with BERT Pretrained model classification task is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.
  • 26. Future work and use cases that BERT can solve for us + Email Prioritization + Sentiment Analysis of Reviews + Review Tagging + Question-Answering for ChatBot & Community + Similar Products problem, we currently use cosine similarity on description text. Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain dataset.

Hinweis der Redaktion

  1. Language discussion
  2. Examples
  3. BERT's power What is bert
  4. Data used and pretrain vs finetune
  5. Talk about:- Bank of the River Bank account
  6. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
  7. Word Piece Tokenizer: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf
  8. Attention Visual:- https://colab.research.google.com/drive/1Nlhh2vwlQdKleNMqpmLDBsAwrv_7NnrB
  9. Understanding the Attention Patterns: https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77
  10. A positional embedding is also added to each token to indicate its position in the sequence.
  11. Advantage of this method is that the Transformer Does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every token.
  12. NSP helps in Q&A and understand the relation b/w sentences.
  13. State of the Art: the most recent stage in the development of a product, incorporating the newest ideas and features. Parse Tree Embedding Concept- mathematical proof
  14. Miscellaneous:- Matthew Correlation Coefficient: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
  15. Miscellaneous: What is a TPU? https://www.google.com/search?q=tpu+full+form&rlz=1C5CHFA_enIN835IN835&oq=TPU+full+form&aqs=chrome.0.0l6.3501j0j9&sourceid=chrome&ie=UTF-8 Which Activation is used in BERT? https://datascience.stackexchange.com/questions/49522/what-is-gelu-activation Gaussian Error Linear Unit
  16. Demo of Google Collab: Sentiment on Movie Reviews: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb?authuser=1#scrollTo=VG3lQz_j2BtD Sentence Pairing and Sentence Classification: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb?authuser=1#scrollTo=0yamCRHcV-nQ BERT FineTune on Data. https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning