SlideShare a Scribd company logo
1 of 36
Convolution Neural Nets for
Language Modeling
Anuj Gupta
Lead Data Scientist, FreshWorks
@anujgupta82
anujgupta82@gmail.com
• Background
• CNN
• Language modeling
• Intuition behind this fusion
• Deep dive
• Key take home
Agenda
3
Building Blocks
4
• Introduced by Yann LeCun in 1998*
• Have been super successful in the area of vision. Almost become bread and butter for computer
vision problems.
• CNN treats image as a signal in spatial domain.
• Have many nice properties that makes super useful :
• Spatially invariant – translation, rotation
• Local structure
• Fast (concurrent calculations in each layer)
Convolutional Neural Nets (CNN)
* LeNet-5 in "Gradient-based learning applied to document recognition" 5
Basics of CNN
• Input : Image
• Image is nothing but a signal in space.
• Represented by matrix with values (RGB)
• Each value ~ wavelength of Red, Green and Blue signals respectively.
• 2 Key operations are : Convolution & Pooling
6
• In simplest terms : given 2 signals x() and h(), convolution combines the
2 signals:
• In the discrete space:
• For our case image is x()
• h() is called filter/kernel/feature detector. Well known concept in the world
of image processing.
Convolution
7
• Ex: Filters for edge detection,
blurring, sharpen, etc
• It is usually a small matrix -
3x3, 5x5, 5x7 etc
• There are well known
predefined filters
https://en.wikipedia.org/wiki/Kernel_(image_processing)
8
1 0 1
0 1 0
1 0 1
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
4
1*1 + 1*0 + 1*1
0*0 + 1*1 + 1*0
0*1 + 0*0 + 1*1
• Convolved feature is nothing but taking a part of the image and applying
filter over it - taking pairwise products and adding them.
9
• Convolved feature map is nothing but sliding the filter over entire image
and applying convolution at each step, as shown in diagram below:
1 0 1
0 1 0
1 0 1
https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-cnn
Filter
10
• Image processing over past many decades has
built many filters for specific tasks.
• In DL (CNN) rather than using predefined filters,
we learn the filters.
• We start with small random values and update
them using gradients
? ? ?
? ? ?
? ? ?
11
• It’s a simple technique for down/sub sampling.
• In CNNs, down sampling, or "pooling" layers are often placed after
convolutional layers.
• They are used mainly to reduce the feature map dimensionality for
computational efficiency. This in turn improves actual performance.
• Takes disjoint chunks of the image (typically 2×22×2) and aggregates
them into a single value.
• Average, max, min, etc. Most popular is max-pooling.
Pooling
https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html
12
Putting it all together
https://adeshpande3.github.io
13
Language Modeling
• Filter out good sentences from bad ones.
• Good = semantically and syntactically correct.
• Model it via probability distribution over sequences of words Pr (w1, w2, ….., wn)
• Assign a probability to a sentence such that
S1 = “the cat jumped over the dog”, Pr(S1) ~ 1
S2 = “jumped over the the cat dog”, Pr(S2) ~ 0
14
Language Modeling
• Machine Translation:
• P(high winds tonite) > P(large winds tonite)
• Spell Correction :
• The office is about fifteen minuets from my house.
• P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition :
• P(I saw a van) >> P(eyes awe of an)
• Summarization, question – answering, etc., etc.
15
• Unary Language Models: Assumes each word occurs completely independent
• Overly simplistic !
• Binary Language Models: A word in a sentence is influenced by its immediate
predecessor (a.k.a Bigram setting)
• This too is naïve but goes long way in understanding some key concepts.
16
• N-gram models: try to capture long term dependencies.
Pr (w1, w2, ….., wn) = 𝑖=1
𝑛
Pr (wi| w1, w2, ….., wi-1)
• This captures how likely is a sentence in a given language.
17
Deep Learning + Language Modeling
• Traditionally uses architecture such as Recurrent Neural Networks (RNN).
• Sequential processing : one unit after other.
• Over time advancements happened and concepts like : 2 way ordering
(Bidirectional), memory(LSTM), attention etc got added.
• Some people explored the possibility of using CNN for Language modeling:
• Pixels spread in space. So they are nothing but signal in space.
• Words/tokens/characters spread in time. So they are nothing but signal in time.
18
CNNs for Language Modeling
19
• Input for any NLP task are sentences/paras/docs in the form of matrix
• Each row of this matrix represents a unit/token of text – character, morpheme,
word etc (typically row = 1-hot or embedding representation of that unit)
• Unlike images, where filter slides over local patches of an image; in NLP we
typically use filters that slide over full rows of the matrix i.e. the “width” of our
filters is usually the same as the width of the input matrix. [1D or temporal
convolutions]
• The height, or region size varies. Typically, window slides over 2-5 words at a
time.
20
21
• Lots of success of CNNs is attributed to :
• Location Invariance : where a object in a image comes doesn’t matter so much
• Local Compositionality : bunch of local objects combine/compose to give more complex
objects.
22
• In CNN+NLP, both aforementioned properties go for a toss
• Where a word comes in a sentence can change the meaning drastically.
Man bites dog.
Dog bites man.
• Parts of phrases could be separated by several other words. Words do compose in some ways,
but how exactly this works, what higher level representations actually “mean” – these aren’t as
obvious as in the Computer Vision case.
“Tim said Robert has lot of experience, he feels you should definitely meet him”
• Both key advantages gone, why are we even thinking of applying CNNs to text ?
RNNs should be the way to go.
23
• “All models are wrong, but some are useful”
• This is not about CNNs vs RNNs (may be both are bad!)
• This is about
• Understanding key difficulties
• Are there some aspects of language modeling where CNNs can do a better job.
• Helps us to better understand strength & weakness of each model.
• Turns out that CNNs applied to certain NLP problems perform quite well. Esp
classification tasks - Sentiment Analysis, Spam Detection or Topic
Categorization.
• CNNs are usually fast, very fast.
24
Major works in this sub-area
• Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP 2014
• Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment
Analysis of Short Texts. COLING-2014
• Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with
Convolutional-Pooling Structure for Information Retrieval. CIKM ’14.
• Santos, C., & Zadrozny, B. (2014). Learning Character-level Representations for Part-of-Speech
Tagging. ICML-14.
• Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text
Classification, 1–9.
• Wenpeng Yin, Hinrich Schutze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: attention-based
convolutional neural network for modeling sentence pairs.
25
• Ngoc Thang Vu, Heike Adel, Pankaj Gupta, and Hinrich Schutze. 2016. Combining recurrent
and convolutional neural networks for relation classification. In Proceedings of NAACL HLT.
pages 534–539.
• Ying Wen, Weinan Zhang, Rui Luo, and Jun Wang.2016. Learning text representation using
recurrent convolutional neural network with highway layers. SIGIR Workshop on Neural
Information Retrieval
• Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with
gated convolutional networks. arXiv preprint arXiv:1612.08083
• Wenpeng Yin, Katharina Kann, Mo Yu and Hinrich Schutze Comparative Study of CNN and
RNN for Natural Language Processing
• Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2015). Character-Aware Neural Language
Models. (Uses a hybrid of CNN and RNN)
26
Deep Dive
27
Character-Aware Neural Language Models *
• Problem statement: Given t words w1, w2, ….., wt ; predict wt+1
• Traditional models : words fed as inputs in the form of word embedding.
• Here input embedding is replaced by output of character level CNN.
• Uses sub word information.
• Traditionally sub word information is fed in terms of morphemes;
Unbreakable : Un ("not") – break (root word) – able (“can be done”)
* “Character-Aware Neural Language Models” Y kim et. al 2015 28
• Identifying morphemes is non trivial. Requires morphological tagging as
preprocessing.
• Y Kim et. al leverage sub word via through a character-level CNN.
• Learn embedding for each character.
• A word w is then nothing but embeddings of it constituent characters.
• For each word, we apply convolution on its character embeddings to obtain features.
• These are then fed to LSTM via highway layers.
• Does not use word embeddings at all.
• In most language models, large % of parameters are because of word
embeddings. Thus, we get much smaller number of parameter to learn.
29
Details
C - vocabulary of characters.
D - dimensionality of character embeddings.
R - matrix character embeddings.
Let word wk = [c1,....,cl] i.e. made from l characters, where l
is length of wk
Character-level representation of wk is given by matrix
Ck ∈ ℝ D X l, where jth column corresponds to character
embedding for jth character of word wk
Apply filter/kernel H to Ck to obtain feature map fk.
ith element of fk is given by:
is not : ith to (i-w+1)th columns of Ck
is called Frobenius product
|C|
D R
l
D Ck
c1 c2 cl
l - w +1
fk 30
• To capture most important feature - we take max over time
yk is the feature corresponding to filter H when applied to word wk.
(~ find most important character n-gram)
• Likewise, they apply multiple h filters : H1, …., Hh.
• Then, yk = is the input representation of word wk.
At this point of time we can either:
• Construct MLP over yk
• Feed yk to LSTM
31
Instead to gain improvements, rather than feeding yk to LSTM, they pass it via Highway
network*
Highway network:
Basic idea: carry some part input directly to output.
While remaining input is processed and then taken forward.
Very similar to residual networks.
F() is typically : affine transformation followed by tanh.
In Highway networks, we learn “what parts of input to be
carried forward via highway”
This is done via gating mechanism called transform gate (t) and carry gate (1-t)
32
33
In nutshell
Results
34
Key take home
• CNNs + NLP surely holds lot of promise.
• Pretty successful in classification setting.
• Can prove be great tool to model the input aspects of NLP.
• What about non-classification settings ?
• Sequence labeling (NER)
• Sequence generation (MT)
• As of today not so successful
• Though people have tried lot of ideas there too.
• de-convolutions in generative settings
• Some architectures use different embeddings as different channels. 35
More Resources
• https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/
• https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f
• wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
• https://blogs.technet.microsoft.com/machinelearning/2017/02/13/cloud-scale-text-classification-with-
convolutional-neural-networks-on-microsoft-azure/
• https://www.aclweb.org/anthology/P/P14/P14-1062.xhtml
• https://github.com/yoonkim/lstm-char-cnn
• https://github.com/yoonkim/CNN_sentence
• https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32
• “Comparative Study of CNN and RNN for Natural Language Processing” Wenpeng Yin et. al 2017,
arXiv:1702.01923 [cs.CL]
36
Thanks
Questions ?
37
@anujgupta82
anujgupta82@gmail.com

More Related Content

What's hot

Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsRoelof Pieters
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP ApplicationsSamiur Rahman
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Learning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryLearning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryRoelof Pieters
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingJonathan Mugan
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddingsRoelof Pieters
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 

What's hot (20)

Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas Mikolov
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP Applications
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Learning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryLearning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionary
 
Language models
Language modelsLanguage models
Language models
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 

Similar to Talk from NVidia Developer Connect

BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...changedaeoh
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
Short story presentation
Short story presentationShort story presentation
Short story presentationStutiAgarwal36
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Alexander Korbonits
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learningPoo Kuan Hoong
 

Similar to Talk from NVidia Developer Connect (20)

Image captioning
Image captioningImage captioning
Image captioning
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Dcnn for text
Dcnn for textDcnn for text
Dcnn for text
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Spectral convnets
Spectral convnetsSpectral convnets
Spectral convnets
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
Short story presentation
Short story presentationShort story presentation
Short story presentation
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Attention
AttentionAttention
Attention
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learning
 

More from Anuj Gupta

ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsAnuj Gupta
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesAnuj Gupta
 
Sarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysisSarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysisAnuj Gupta
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Synthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural NetsSynthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural NetsAnuj Gupta
 
Representation Learning for NLP
Representation Learning for NLPRepresentation Learning for NLP
Representation Learning for NLPAnuj Gupta
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 

More from Anuj Gupta (7)

ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
 
Sarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysisSarcasm Detection: Achilles Heel of sentiment analysis
Sarcasm Detection: Achilles Heel of sentiment analysis
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Synthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural NetsSynthetic Gradients - Decoupling Layers of a Neural Nets
Synthetic Gradients - Decoupling Layers of a Neural Nets
 
Representation Learning for NLP
Representation Learning for NLPRepresentation Learning for NLP
Representation Learning for NLP
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Talk from NVidia Developer Connect

  • 1. Convolution Neural Nets for Language Modeling Anuj Gupta Lead Data Scientist, FreshWorks @anujgupta82 anujgupta82@gmail.com
  • 2. • Background • CNN • Language modeling • Intuition behind this fusion • Deep dive • Key take home Agenda 3
  • 4. • Introduced by Yann LeCun in 1998* • Have been super successful in the area of vision. Almost become bread and butter for computer vision problems. • CNN treats image as a signal in spatial domain. • Have many nice properties that makes super useful : • Spatially invariant – translation, rotation • Local structure • Fast (concurrent calculations in each layer) Convolutional Neural Nets (CNN) * LeNet-5 in "Gradient-based learning applied to document recognition" 5
  • 5. Basics of CNN • Input : Image • Image is nothing but a signal in space. • Represented by matrix with values (RGB) • Each value ~ wavelength of Red, Green and Blue signals respectively. • 2 Key operations are : Convolution & Pooling 6
  • 6. • In simplest terms : given 2 signals x() and h(), convolution combines the 2 signals: • In the discrete space: • For our case image is x() • h() is called filter/kernel/feature detector. Well known concept in the world of image processing. Convolution 7
  • 7. • Ex: Filters for edge detection, blurring, sharpen, etc • It is usually a small matrix - 3x3, 5x5, 5x7 etc • There are well known predefined filters https://en.wikipedia.org/wiki/Kernel_(image_processing) 8
  • 8. 1 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 4 1*1 + 1*0 + 1*1 0*0 + 1*1 + 1*0 0*1 + 0*0 + 1*1 • Convolved feature is nothing but taking a part of the image and applying filter over it - taking pairwise products and adding them. 9
  • 9. • Convolved feature map is nothing but sliding the filter over entire image and applying convolution at each step, as shown in diagram below: 1 0 1 0 1 0 1 0 1 https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-cnn Filter 10
  • 10. • Image processing over past many decades has built many filters for specific tasks. • In DL (CNN) rather than using predefined filters, we learn the filters. • We start with small random values and update them using gradients ? ? ? ? ? ? ? ? ? 11
  • 11. • It’s a simple technique for down/sub sampling. • In CNNs, down sampling, or "pooling" layers are often placed after convolutional layers. • They are used mainly to reduce the feature map dimensionality for computational efficiency. This in turn improves actual performance. • Takes disjoint chunks of the image (typically 2×22×2) and aggregates them into a single value. • Average, max, min, etc. Most popular is max-pooling. Pooling https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html 12
  • 12. Putting it all together https://adeshpande3.github.io 13
  • 13. Language Modeling • Filter out good sentences from bad ones. • Good = semantically and syntactically correct. • Model it via probability distribution over sequences of words Pr (w1, w2, ….., wn) • Assign a probability to a sentence such that S1 = “the cat jumped over the dog”, Pr(S1) ~ 1 S2 = “jumped over the the cat dog”, Pr(S2) ~ 0 14
  • 14. Language Modeling • Machine Translation: • P(high winds tonite) > P(large winds tonite) • Spell Correction : • The office is about fifteen minuets from my house. • P(about fifteen minutes from) > P(about fifteen minuets from) • Speech Recognition : • P(I saw a van) >> P(eyes awe of an) • Summarization, question – answering, etc., etc. 15
  • 15. • Unary Language Models: Assumes each word occurs completely independent • Overly simplistic ! • Binary Language Models: A word in a sentence is influenced by its immediate predecessor (a.k.a Bigram setting) • This too is naïve but goes long way in understanding some key concepts. 16
  • 16. • N-gram models: try to capture long term dependencies. Pr (w1, w2, ….., wn) = 𝑖=1 𝑛 Pr (wi| w1, w2, ….., wi-1) • This captures how likely is a sentence in a given language. 17
  • 17. Deep Learning + Language Modeling • Traditionally uses architecture such as Recurrent Neural Networks (RNN). • Sequential processing : one unit after other. • Over time advancements happened and concepts like : 2 way ordering (Bidirectional), memory(LSTM), attention etc got added. • Some people explored the possibility of using CNN for Language modeling: • Pixels spread in space. So they are nothing but signal in space. • Words/tokens/characters spread in time. So they are nothing but signal in time. 18
  • 18. CNNs for Language Modeling 19
  • 19. • Input for any NLP task are sentences/paras/docs in the form of matrix • Each row of this matrix represents a unit/token of text – character, morpheme, word etc (typically row = 1-hot or embedding representation of that unit) • Unlike images, where filter slides over local patches of an image; in NLP we typically use filters that slide over full rows of the matrix i.e. the “width” of our filters is usually the same as the width of the input matrix. [1D or temporal convolutions] • The height, or region size varies. Typically, window slides over 2-5 words at a time. 20
  • 20. 21
  • 21. • Lots of success of CNNs is attributed to : • Location Invariance : where a object in a image comes doesn’t matter so much • Local Compositionality : bunch of local objects combine/compose to give more complex objects. 22
  • 22. • In CNN+NLP, both aforementioned properties go for a toss • Where a word comes in a sentence can change the meaning drastically. Man bites dog. Dog bites man. • Parts of phrases could be separated by several other words. Words do compose in some ways, but how exactly this works, what higher level representations actually “mean” – these aren’t as obvious as in the Computer Vision case. “Tim said Robert has lot of experience, he feels you should definitely meet him” • Both key advantages gone, why are we even thinking of applying CNNs to text ? RNNs should be the way to go. 23
  • 23. • “All models are wrong, but some are useful” • This is not about CNNs vs RNNs (may be both are bad!) • This is about • Understanding key difficulties • Are there some aspects of language modeling where CNNs can do a better job. • Helps us to better understand strength & weakness of each model. • Turns out that CNNs applied to certain NLP problems perform quite well. Esp classification tasks - Sentiment Analysis, Spam Detection or Topic Categorization. • CNNs are usually fast, very fast. 24
  • 24. Major works in this sub-area • Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP 2014 • Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. COLING-2014 • Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. CIKM ’14. • Santos, C., & Zadrozny, B. (2014). Learning Character-level Representations for Part-of-Speech Tagging. ICML-14. • Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification, 1–9. • Wenpeng Yin, Hinrich Schutze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: attention-based convolutional neural network for modeling sentence pairs. 25
  • 25. • Ngoc Thang Vu, Heike Adel, Pankaj Gupta, and Hinrich Schutze. 2016. Combining recurrent and convolutional neural networks for relation classification. In Proceedings of NAACL HLT. pages 534–539. • Ying Wen, Weinan Zhang, Rui Luo, and Jun Wang.2016. Learning text representation using recurrent convolutional neural network with highway layers. SIGIR Workshop on Neural Information Retrieval • Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083 • Wenpeng Yin, Katharina Kann, Mo Yu and Hinrich Schutze Comparative Study of CNN and RNN for Natural Language Processing • Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2015). Character-Aware Neural Language Models. (Uses a hybrid of CNN and RNN) 26
  • 27. Character-Aware Neural Language Models * • Problem statement: Given t words w1, w2, ….., wt ; predict wt+1 • Traditional models : words fed as inputs in the form of word embedding. • Here input embedding is replaced by output of character level CNN. • Uses sub word information. • Traditionally sub word information is fed in terms of morphemes; Unbreakable : Un ("not") – break (root word) – able (“can be done”) * “Character-Aware Neural Language Models” Y kim et. al 2015 28
  • 28. • Identifying morphemes is non trivial. Requires morphological tagging as preprocessing. • Y Kim et. al leverage sub word via through a character-level CNN. • Learn embedding for each character. • A word w is then nothing but embeddings of it constituent characters. • For each word, we apply convolution on its character embeddings to obtain features. • These are then fed to LSTM via highway layers. • Does not use word embeddings at all. • In most language models, large % of parameters are because of word embeddings. Thus, we get much smaller number of parameter to learn. 29
  • 29. Details C - vocabulary of characters. D - dimensionality of character embeddings. R - matrix character embeddings. Let word wk = [c1,....,cl] i.e. made from l characters, where l is length of wk Character-level representation of wk is given by matrix Ck ∈ ℝ D X l, where jth column corresponds to character embedding for jth character of word wk Apply filter/kernel H to Ck to obtain feature map fk. ith element of fk is given by: is not : ith to (i-w+1)th columns of Ck is called Frobenius product |C| D R l D Ck c1 c2 cl l - w +1 fk 30
  • 30. • To capture most important feature - we take max over time yk is the feature corresponding to filter H when applied to word wk. (~ find most important character n-gram) • Likewise, they apply multiple h filters : H1, …., Hh. • Then, yk = is the input representation of word wk. At this point of time we can either: • Construct MLP over yk • Feed yk to LSTM 31
  • 31. Instead to gain improvements, rather than feeding yk to LSTM, they pass it via Highway network* Highway network: Basic idea: carry some part input directly to output. While remaining input is processed and then taken forward. Very similar to residual networks. F() is typically : affine transformation followed by tanh. In Highway networks, we learn “what parts of input to be carried forward via highway” This is done via gating mechanism called transform gate (t) and carry gate (1-t) 32
  • 34. Key take home • CNNs + NLP surely holds lot of promise. • Pretty successful in classification setting. • Can prove be great tool to model the input aspects of NLP. • What about non-classification settings ? • Sequence labeling (NER) • Sequence generation (MT) • As of today not so successful • Though people have tried lot of ideas there too. • de-convolutions in generative settings • Some architectures use different embeddings as different channels. 35
  • 35. More Resources • https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/ • https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f • wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ • https://blogs.technet.microsoft.com/machinelearning/2017/02/13/cloud-scale-text-classification-with- convolutional-neural-networks-on-microsoft-azure/ • https://www.aclweb.org/anthology/P/P14/P14-1062.xhtml • https://github.com/yoonkim/lstm-char-cnn • https://github.com/yoonkim/CNN_sentence • https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32 • “Comparative Study of CNN and RNN for Natural Language Processing” Wenpeng Yin et. al 2017, arXiv:1702.01923 [cs.CL] 36