SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
Natural LanguageProcessing
Yuriy Guts – Jul 09, 2016
Who Is This Guy?
Data Science Team Lead
Sr. Data Scientist
Software Architect, R&D Engineer
I also teach Machine Learning:
What is NLP?
Study of interaction between computers and human languages
NLP = Computer Science + AI + Computational Linguistics
Common NLP Tasks
Easy Medium Hard
• Chunking
• Part-of-Speech Tagging
• Named Entity Recognition
• Spam Detection
• Thesaurus
• Syntactic Parsing
• Word Sense Disambiguation
• Sentiment Analysis
• Topic Modeling
• Information Retrieval
• Machine Translation
• Text Generation
• Automatic Summarization
• Question Answering
• Conversational Interfaces
Interdisciplinary Tasks: Speech-to-Text
Interdisciplinary Tasks: Image Captioning
What Makes NLP so Hard?
Ambiguity
Non-Standard Language
Also: neologisms, complex entity names, phrasal verbs/idioms
More Complex Languages Than English
• German: Donaudampfschiffahrtsgesellschaftskapitän (5 “words”)
• Chinese: 50,000 different characters (2-3k to read a newspaper)
• Japanese: 3 writing systems
• Thai: Ambiguous word boundaries and sentence concepts
• Slavic: Different word forms depending on gender, case, tense
Write Traditional “If-Then-Else” Rules?
BIG NOPE!
Leads to very large and complex codebases.
Still struggles to capture trivial cases (for a human).
Better Approach: Machine Learning
“ • A computer program is said to learn from experience E
• with respect to some class of tasks T and performance measure P,
• if its performance at tasks in T, as measured by P,
• improves with experience E.
— Tom M. Mitchell
Part 1
Essential Machine Learning Backgroundfor NLP
Before We Begin: Disclaimer
• This will be a very quick description of ML. By no means exhaustive.
• Only the essential background for what we’ll have in Part 2.
• To fit everything into a small timeframe, I’ll simplify some aspects.
• I encourage you to read ML books or watch videos to dig deeper.
Common ML Tasks
• Regression
• Classification (Binary or Multi-Class)
1. Supervised Learning
2. Unsupervised Learning
• Clustering
• Anomaly Detection
• Latent Variable Models (Dimensionality Reduction, EM, …)
Regression
Predict a continuous dependent variable
based on independent predictors
Linear Regression
After adding polynomial features
Classification
Assign an observation to some category
from a known discrete list of categories
Logistic Regression
Class A
Class B
(Multi-class extension = Softmax Regression)
Neural Networks
and Backpropagation Algorithm
http://playground.tensorflow.org/
Clustering
Group objects in such a way
that objects in the same group are similar,
and objects in the different groups are not
K-Means Clustering
Evaluation
How do we know if an ML model is good?
What do we do if something goes wrong?
Underfitting & Overfitting
Development & Troubleshooting
• Picking the right metric: MAE, RMSE, AUC, Cross-Entropy, Log-Loss
• Training Set / Validation Set / Test Set split
• Picking hyperparameters against Validation Set
• Regularization to prevent OF
• Plotting learning curves to check for UF/OF
Deep Learning
• Core idea: instead of hand-crafting complex features, use increased computing
capacity and build a deep computation graph that will try to learn feature
representations on its own.
End-to-end learning rather than a cascade of apps.
• Works best with lots of homogeneous, spatially related features
(image pixels, character sequences, audio signal measurements).
Usually works poorly otherwise.
• State-of-the-art and/or superhuman performance on many tasks.
• Typically requires massive amounts of data and training resources.
• But: a very young field. Theories not strongly established, views change.
Example: Convolutional Neural Network
Part 2
NLP Challenges And Approaches
“Classical” NLP Pipeline
Tokenization
Morphology
Syntax
Semantics
Discourse
Break text into sentences and words, lemmatize
Part of speech (POS) tagging, stemming, NER
Constituency/dependency parsing
Coreference resolution, wordsense disambiguation
Task-dependent (sentiment, …)
Often Relies on Language Banks
• WordNet (ontology, semantic similarity tree)
• Penn Treebank (POS, grammar rules)
• PropBank (semantic propositions)
• …Dozens of them!
Tokenization & Stemming
POS/NER Tagging
Parsing (LPCFG)
“Classical” way: Training a NER Tagger
Task: Predict whether the word is a PERSON, LOCATION, DATE or OTHER.
Could be more than 3 NER tags (e.g. MUC-7 contains 7 tags).
1. Current word.
2. Previous, next word (context).
3. POS tags of current word and nearby words.
4. NER label for previous word.
5. Word substrings (e.g. ends in “burg”, contains “oxa” etc.)
6. Word shape (internal capitalization, numerals, dashes etc.).
7. …on and on and on…
Features:
Feature Representation: Bag of Words
A single word is a one-hot encoding vector with the size of the dictionary :(
Problem
• Manually designed features are often over-specified, incomplete,
take a long time to design and validate.
• Often requires PhD-level knowledge of the domain.
• Researchers spend literally decades hand-crafting features.
• Bag of words model is very high-dimensional and sparse,
cannot capture semantics or morphology.
Maybe Deep Learning can help?
Deep Learning for NLP
• Core enabling idea: represent words as dense vectors
[0 1 0 0 0 0 0 0 0] [0.315 0.136 0.831]
• Try to capture semantic and morphologic similarity so that the features
for “similar” words are “similar”
(e.g. closer in Euclidean space).
• Natural language is context dependent: use context for learning.
• Straightforward (but slow) way: build a co-occurrence matrix and SVD it.
Embedding Methods: Word2Vec
CBoW version: predict center word from context Skip-gram version: predict context from center word
Benefits
• Learns features of each word on its own, given a text corpus.
• No heavy preprocessing is required, just a corpus.
• Word vectors can be used as features for lots of supervised
learning applications: POS, NER, chunking, semantic role labeling.
All with pretty much the same network architecture.
• Similarities and linear relationships between word vectors.
• A bit more modern representation: GloVe, but requires more RAM.
Linearities
Training a NER Tagger: Deep Learning
Just replace this with NER tag
(or POS tag, chunk end, etc.)
Language Modeling
Assign high probabilities to well-formed sentences
(crucial for text generation, speech recognition, machine translation)
“Classical” Way: N-Grams
Problem: doesn’t scale well to bigger N. N = 5 is pretty much the limit.
Deep Learning Way: Recurrent NN (RNN)
Can use past information without restricting the size of the context.
But: in practice, can’t recall information that came in a long time ago.
Long Short Term Memory Network (LSTM)
Contains gates that control forgetting, adding, updating and outputting information.
Surprisingly amazing performance at language tasks compared to vanilla RNN.
Tackling Hard Tasks
Deep Learning enables end-to-
end learning for Machine
Translation, Image Captioning,
Text Generation, Summarization:
NLP tasks which are inherently
very hard!
RNN for Machine Translation
Hottest Current Research
• Attention Networks
• Dynamic Memory Networks
(see ICML 2016 proceedings)
Tools I Used
• NLTK (Python)
• Gensim (Python)
• Stanford CoreNLP (Java with bindings)
• Apache OpenNLP (Java with bindings)
Deep Learning Frameworks with GPU Support:
• Torch (Torch-RNN) (Lua)
• TensorFlow, Theano, Keras (Python)
NLP Progress for Ukrainian
• Ukrainian lemma dictionary with POS tags
https://github.com/arysin/dict_uk
• Ukrainian lemmatizer plugin for ElasticSearch
https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
• lang-uk project (1M corpus, NER, tokenization, etc.)
https://github.com/lang-uk
Demo 1: Exploring Semantic Properties Of ASOIAF(“Game of Thrones”)
Demo 2: TopicModeling for DOU.UA Comments
GitHub Repos with IPython Notebooks
• https://github.com/YuriyGuts/thrones2vec
• https://github.com/YuriyGuts/dou-topic-modeling
yuriy.guts@gmail.com
linkedin.com/in/yuriyguts
github.com/YuriyGuts

Weitere ähnliche Inhalte

Was ist angesagt?

Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingsaurabhnarhe
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionAritra Mukherjee
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Natural language processing
Natural language processingNatural language processing
Natural language processingAbash shah
 
Natural language processing
Natural language processingNatural language processing
Natural language processingHansi Thenuwara
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentationSai Mohith
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
Natural language processing
Natural language processingNatural language processing
Natural language processingKarenVacca
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 

Was ist angesagt? (20)

NLP
NLPNLP
NLP
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP
NLPNLP
NLP
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 

Andere mochten auch

Building effective communication skills using NLP
Building effective communication skills using NLPBuilding effective communication skills using NLP
Building effective communication skills using NLPIIBA UK Chapter
 
Selling technique - With NLP
Selling technique - With NLPSelling technique - With NLP
Selling technique - With NLPPratibha Mishra
 
How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)Steven Hoober
 
What 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From FailureWhat 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From FailureReferralCandy
 
Upworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The InternetsUpworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The InternetsUpworthy
 
Five Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same SlideFive Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same SlideCrispy Presentations
 
A-Z Culture Glossary 2017
A-Z Culture Glossary 2017A-Z Culture Glossary 2017
A-Z Culture Glossary 2017sparks & honey
 
Digital Strategy 101
Digital Strategy 101Digital Strategy 101
Digital Strategy 101Bud Caddell
 
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)Board of Innovation
 
The What If Technique presented by Motivate Design
The What If Technique presented by Motivate DesignThe What If Technique presented by Motivate Design
The What If Technique presented by Motivate DesignMotivate Design
 
The Seven Deadly Social Media Sins
The Seven Deadly Social Media SinsThe Seven Deadly Social Media Sins
The Seven Deadly Social Media SinsXPLAIN
 
The History of SEO
The History of SEOThe History of SEO
The History of SEOHubSpot
 
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating PresentersWhat Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating PresentersHubSpot
 
10 Powerful Body Language Tips for your next Presentation
10 Powerful Body Language Tips for your next Presentation10 Powerful Body Language Tips for your next Presentation
10 Powerful Body Language Tips for your next PresentationSOAP Presentations
 

Andere mochten auch (16)

Building effective communication skills using NLP
Building effective communication skills using NLPBuilding effective communication skills using NLP
Building effective communication skills using NLP
 
Selling technique - With NLP
Selling technique - With NLPSelling technique - With NLP
Selling technique - With NLP
 
How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)
 
What 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From FailureWhat 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From Failure
 
Upworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The InternetsUpworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The Internets
 
Five Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same SlideFive Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same Slide
 
A-Z Culture Glossary 2017
A-Z Culture Glossary 2017A-Z Culture Glossary 2017
A-Z Culture Glossary 2017
 
Digital Strategy 101
Digital Strategy 101Digital Strategy 101
Digital Strategy 101
 
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
 
The What If Technique presented by Motivate Design
The What If Technique presented by Motivate DesignThe What If Technique presented by Motivate Design
The What If Technique presented by Motivate Design
 
The Seven Deadly Social Media Sins
The Seven Deadly Social Media SinsThe Seven Deadly Social Media Sins
The Seven Deadly Social Media Sins
 
The History of SEO
The History of SEOThe History of SEO
The History of SEO
 
Displaying Data
Displaying DataDisplaying Data
Displaying Data
 
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating PresentersWhat Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
10 Powerful Body Language Tips for your next Presentation
10 Powerful Body Language Tips for your next Presentation10 Powerful Body Language Tips for your next Presentation
10 Powerful Body Language Tips for your next Presentation
 

Ähnlich wie Natural Language Processing (NLP)

Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Languagebutest
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
An Introduction to Recent Advances in the Field of NLP
An Introduction to Recent Advances in the Field of NLPAn Introduction to Recent Advances in the Field of NLP
An Introduction to Recent Advances in the Field of NLPRrubaa Panchendrarajan
 
State-of-the-Art Text Classification using Deep Contextual Word Representations
State-of-the-Art Text Classification using Deep Contextual Word RepresentationsState-of-the-Art Text Classification using Deep Contextual Word Representations
State-of-the-Art Text Classification using Deep Contextual Word RepresentationsAusaf Ahmed
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0Plain Concepts
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Ana Marasović
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systemsCJ Jenkins
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLPVijay Ganti
 
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxEXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxAtulKumarUpadhyay4
 

Ähnlich wie Natural Language Processing (NLP) (20)

Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
asdrfasdfasdf
asdrfasdfasdfasdrfasdfasdf
asdrfasdfasdf
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Language
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
An Introduction to Recent Advances in the Field of NLP
An Introduction to Recent Advances in the Field of NLPAn Introduction to Recent Advances in the Field of NLP
An Introduction to Recent Advances in the Field of NLP
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
State-of-the-Art Text Classification using Deep Contextual Word Representations
State-of-the-Art Text Classification using Deep Contextual Word RepresentationsState-of-the-Art Text Classification using Deep Contextual Word Representations
State-of-the-Art Text Classification using Deep Contextual Word Representations
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
 
Nltk
NltkNltk
Nltk
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxEXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
 

Mehr von Yuriy Guts

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine LearningYuriy Guts
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLPYuriy Guts
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)Yuriy Guts
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivYuriy Guts
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of RedisYuriy Guts
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in ScalaYuriy Guts
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET DevelopersYuriy Guts
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETYuriy Guts
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsYuriy Guts
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software ArchitectureYuriy Guts
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business AnalystsYuriy Guts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceYuriy Guts
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseYuriy Guts
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesYuriy Guts
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingYuriy Guts
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryYuriy Guts
 

Mehr von Yuriy Guts (19)

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine Learning
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG Lviv
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of Redis
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET Developers
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NET
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software Architecture
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business Analysts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT Audience
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash Course
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - Databases
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - Multithreading
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and Memory
 

Kürzlich hochgeladen

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 

Kürzlich hochgeladen (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 

Natural Language Processing (NLP)

  • 2. Who Is This Guy? Data Science Team Lead Sr. Data Scientist Software Architect, R&D Engineer I also teach Machine Learning:
  • 3. What is NLP? Study of interaction between computers and human languages NLP = Computer Science + AI + Computational Linguistics
  • 4. Common NLP Tasks Easy Medium Hard • Chunking • Part-of-Speech Tagging • Named Entity Recognition • Spam Detection • Thesaurus • Syntactic Parsing • Word Sense Disambiguation • Sentiment Analysis • Topic Modeling • Information Retrieval • Machine Translation • Text Generation • Automatic Summarization • Question Answering • Conversational Interfaces
  • 7. What Makes NLP so Hard?
  • 9. Non-Standard Language Also: neologisms, complex entity names, phrasal verbs/idioms
  • 10. More Complex Languages Than English • German: Donaudampfschiffahrtsgesellschaftskapitän (5 “words”) • Chinese: 50,000 different characters (2-3k to read a newspaper) • Japanese: 3 writing systems • Thai: Ambiguous word boundaries and sentence concepts • Slavic: Different word forms depending on gender, case, tense
  • 11. Write Traditional “If-Then-Else” Rules? BIG NOPE! Leads to very large and complex codebases. Still struggles to capture trivial cases (for a human).
  • 12. Better Approach: Machine Learning “ • A computer program is said to learn from experience E • with respect to some class of tasks T and performance measure P, • if its performance at tasks in T, as measured by P, • improves with experience E. — Tom M. Mitchell
  • 13. Part 1 Essential Machine Learning Backgroundfor NLP
  • 14. Before We Begin: Disclaimer • This will be a very quick description of ML. By no means exhaustive. • Only the essential background for what we’ll have in Part 2. • To fit everything into a small timeframe, I’ll simplify some aspects. • I encourage you to read ML books or watch videos to dig deeper.
  • 15. Common ML Tasks • Regression • Classification (Binary or Multi-Class) 1. Supervised Learning 2. Unsupervised Learning • Clustering • Anomaly Detection • Latent Variable Models (Dimensionality Reduction, EM, …)
  • 16.
  • 17.
  • 18. Regression Predict a continuous dependent variable based on independent predictors
  • 19.
  • 20.
  • 21.
  • 23.
  • 25.
  • 26. Classification Assign an observation to some category from a known discrete list of categories
  • 27. Logistic Regression Class A Class B (Multi-class extension = Softmax Regression)
  • 30. Clustering Group objects in such a way that objects in the same group are similar, and objects in the different groups are not
  • 32. Evaluation How do we know if an ML model is good? What do we do if something goes wrong?
  • 34. Development & Troubleshooting • Picking the right metric: MAE, RMSE, AUC, Cross-Entropy, Log-Loss • Training Set / Validation Set / Test Set split • Picking hyperparameters against Validation Set • Regularization to prevent OF • Plotting learning curves to check for UF/OF
  • 35. Deep Learning • Core idea: instead of hand-crafting complex features, use increased computing capacity and build a deep computation graph that will try to learn feature representations on its own. End-to-end learning rather than a cascade of apps. • Works best with lots of homogeneous, spatially related features (image pixels, character sequences, audio signal measurements). Usually works poorly otherwise. • State-of-the-art and/or superhuman performance on many tasks. • Typically requires massive amounts of data and training resources. • But: a very young field. Theories not strongly established, views change.
  • 37. Part 2 NLP Challenges And Approaches
  • 38. “Classical” NLP Pipeline Tokenization Morphology Syntax Semantics Discourse Break text into sentences and words, lemmatize Part of speech (POS) tagging, stemming, NER Constituency/dependency parsing Coreference resolution, wordsense disambiguation Task-dependent (sentiment, …)
  • 39. Often Relies on Language Banks • WordNet (ontology, semantic similarity tree) • Penn Treebank (POS, grammar rules) • PropBank (semantic propositions) • …Dozens of them!
  • 43. “Classical” way: Training a NER Tagger Task: Predict whether the word is a PERSON, LOCATION, DATE or OTHER. Could be more than 3 NER tags (e.g. MUC-7 contains 7 tags). 1. Current word. 2. Previous, next word (context). 3. POS tags of current word and nearby words. 4. NER label for previous word. 5. Word substrings (e.g. ends in “burg”, contains “oxa” etc.) 6. Word shape (internal capitalization, numerals, dashes etc.). 7. …on and on and on… Features:
  • 44. Feature Representation: Bag of Words A single word is a one-hot encoding vector with the size of the dictionary :(
  • 45. Problem • Manually designed features are often over-specified, incomplete, take a long time to design and validate. • Often requires PhD-level knowledge of the domain. • Researchers spend literally decades hand-crafting features. • Bag of words model is very high-dimensional and sparse, cannot capture semantics or morphology. Maybe Deep Learning can help?
  • 46. Deep Learning for NLP • Core enabling idea: represent words as dense vectors [0 1 0 0 0 0 0 0 0] [0.315 0.136 0.831] • Try to capture semantic and morphologic similarity so that the features for “similar” words are “similar” (e.g. closer in Euclidean space). • Natural language is context dependent: use context for learning. • Straightforward (but slow) way: build a co-occurrence matrix and SVD it.
  • 47. Embedding Methods: Word2Vec CBoW version: predict center word from context Skip-gram version: predict context from center word
  • 48. Benefits • Learns features of each word on its own, given a text corpus. • No heavy preprocessing is required, just a corpus. • Word vectors can be used as features for lots of supervised learning applications: POS, NER, chunking, semantic role labeling. All with pretty much the same network architecture. • Similarities and linear relationships between word vectors. • A bit more modern representation: GloVe, but requires more RAM.
  • 50. Training a NER Tagger: Deep Learning Just replace this with NER tag (or POS tag, chunk end, etc.)
  • 51. Language Modeling Assign high probabilities to well-formed sentences (crucial for text generation, speech recognition, machine translation)
  • 52. “Classical” Way: N-Grams Problem: doesn’t scale well to bigger N. N = 5 is pretty much the limit.
  • 53. Deep Learning Way: Recurrent NN (RNN) Can use past information without restricting the size of the context. But: in practice, can’t recall information that came in a long time ago.
  • 54. Long Short Term Memory Network (LSTM) Contains gates that control forgetting, adding, updating and outputting information. Surprisingly amazing performance at language tasks compared to vanilla RNN.
  • 55. Tackling Hard Tasks Deep Learning enables end-to- end learning for Machine Translation, Image Captioning, Text Generation, Summarization: NLP tasks which are inherently very hard! RNN for Machine Translation
  • 56. Hottest Current Research • Attention Networks • Dynamic Memory Networks (see ICML 2016 proceedings)
  • 57. Tools I Used • NLTK (Python) • Gensim (Python) • Stanford CoreNLP (Java with bindings) • Apache OpenNLP (Java with bindings) Deep Learning Frameworks with GPU Support: • Torch (Torch-RNN) (Lua) • TensorFlow, Theano, Keras (Python)
  • 58. NLP Progress for Ukrainian • Ukrainian lemma dictionary with POS tags https://github.com/arysin/dict_uk • Ukrainian lemmatizer plugin for ElasticSearch https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer • lang-uk project (1M corpus, NER, tokenization, etc.) https://github.com/lang-uk
  • 59. Demo 1: Exploring Semantic Properties Of ASOIAF(“Game of Thrones”) Demo 2: TopicModeling for DOU.UA Comments
  • 60. GitHub Repos with IPython Notebooks • https://github.com/YuriyGuts/thrones2vec • https://github.com/YuriyGuts/dou-topic-modeling