SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
1
Natural Language Processing
for Everyday People
Dr. Rebecca Bilbro,
Lead Data Scientist
2
Rebecca Bilbro
Lead Data Scientist, Bytecubed
Organizer, Data Science DC
Faculty, Georgetown Univ.
& District Data Labs
rebecca.bilbro@bytecubed.com
github.com/rebeccabilbro
twitter.com/rebeccabilbro
3
Main take-aways
NLP is...
• different from numerical ML (but also the same).
• not about beautiful, bespoke algorithms.
• hard and messy work.
• necessary.
4
Overview
• Everyday NLP
• Language aware applications
• Nuts and bolts of NLP
• Open source tools
• Questions
5
Everyday NLP
6
• Summarization
• Reference Resolution
• Machine Translation
• Language Generation
• Language Understanding
• Document Classification
• Author Identification
• Part of Speech Tagging
• Question Answering
• Information Extraction
• Information Retrieval
• Speech Recognition
• Sense Disambiguation
• Topic Recognition
• Relationship Detection
• Named Entity Recognition
Everyday NLP Problems
7
Coreference resolution
Question answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguation (WSD)
Paraphrase
Named entity recognition (NER)
Parsing
Summarization
Information extraction (IE)
Machine translation (MT)
Dialog
Sentiment analysis
mostly solved
making good progress
still really hard
Spam detection
Let’s go to Agra!
Buy V1AGRA …
✓
✗
Colorless green ideas sleep furiously.
ADJ ADJ NOUN VERB ADV
Einstein met with UN officials in Princeton
PERSON ORG LOC
You’re invited to our dinner
party, Friday May 27 at 8:30
Party
May 27
add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.
I need new batteries for my mouse.
The 13th
Shanghai International Film Festival…
第13届上海国际电影节开幕…
The Dow Jones is up
Housing prices rose
Economy is
good
Q. How effective is ibuprofen in reducing
fever in patients with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Where is Citizen Kane playing in SF?
Castro Theatre at 7:30. Do
you want a ticket?
The S&P500 jumped
Dan Jurafsky
8
9
Language aware applications
10
How to build a data product
Data Ingestion Data Wrangling Computational
Data Store
WORM Store
Data ExplorationFeature Analysis
Model Storage
Model Fitting
Model Evaluation
and Selection
Application
Feedback
11
How to build a language aware application
Data Ingestion Wrangling Preprocessing
WORM Store
Analytics
Corpus Reader
Preprocessing
Corpus Reader
Raw Corpus
Tokenized
Corpus
Text
Vectorization
Model Fitting
Model Store
Application
Lexical
Resources
Feedback
12
Language Aware Applications
• Not automagic
• Take in text data as input
• Parse into composite parts
• Compute upon composites
• Derive model
• Introduce new data
• Predict
• Deliver result
• Ingest feedback
• Do it all again (but better!)
13
Language Aware Applications
Challenges:
• Ingestion
• Messy data
• Language Ambiguity
• High dimensional feature space
• Computation speed
• Cross-validation
• Pipelines
14
Language Aware Applications
Requirements:
• Robust data management
• Domain-specific corpora
• Normalization
• Vectorization
• Streaming
• Dimensionality reduction
• Visualization
• Repeatability
15
Nuts and bolts of NLP
16
Nuts and Bolts
• Ingestion
• Parsing
• Tokenization
• Normalization
• Vectorization
• Classification
• Clustering
• Pipelines
17
Nuts and Bolts
• Ingestion
• Parsing
• Tokenization
• Normalization
• Vectorization
• Classification
• Clustering
• Pipelines
18
High Level
Data Ingestion
Preprocessing
Transformer
Pre-processed
CorpusReader
WORM Storage
(raw XML/HTML/JSON/CSV)
Tokenized
Corpus
Post-processed
CorpusReader
19
Nuts and Bolts
• Ingestion
• Parsing
• Tokenization
• Normalization
• Vectorization
• Classification
• Clustering
• Pipelines
20
Parse
from nltk import sent_tokenize
def sents(self, fileids=None, categories=None):
"""
Uses built-in NLTK sentence tokenizer to extract sentences from
Paragraphs.
"""
for paragraph in self.paras(fileids, categories):
for sentence in sent_tokenize(paragraph):
yield sentence
21
Tokenize
from nltk import wordpunct_tokenize
def words(self, fileids=None, categories=None):
"""
Use built-in NLTK word tokenizer to extract tokens from
sentences.
"""
for sentence in self.sents(fileids, categories):
for token in wordpunct_tokenize(sentence):
yield token
22
Tag
from nltk import pos_tag
def tokenize(self, fileids=None, categories=None):
"""
Segments, tokenizes, and tags a document in the corpus.
"""
for paragraph in self.corpus.paras(fileids=fileid):
yield [
nltk.pos_tag(nltk.wordpunct_tokenize(sent))
for sent in nltk.sent_tokenize(paragraph)
]
23
Nuts and Bolts
• Ingestion
• Parsing
• Tokenization
• Normalization
• Vectorization
• Classification
• Clustering
• Pipelines
24
Normalization
import nltk
import string
def tokenize(text):
stem = nltk.stem.SnowballStemmer('english')
text = text.lower()
for token in nltk.word_tokenize(text):
if token in string.punctuation: continue
yield stem.stem(token)
corpus = [
"The elephant sneezed at the sight of potatoes.",
"Bats can see via echolocation. See the bat sight sneeze!",
"Wondering, she opened the door to the studio.",
]
25
Nuts and Bolts
• Ingestion
• Parsing
• Tokenization
• Normalization
• Vectorization
• Classification
• Clustering
• Pipelines
26
Vectorization
The elephant sneezed
at the sight of potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she opened
the door to the studio.
at
bat
can
door
echolocation
elephant of
open
potato
see
she
sight
sneeze
studio
the to via
w
onder
Multiple Options!
Bag-of-words · One-hot encoding · TFIDF · Distributed representation
27
Nuts and Bolts
• Ingestion
• Parsing
• Tokenization
• Normalization
• Vectorization
• Classification
• Clustering
• Pipelines
28
Machine Learning on Text
Data Management Layer
Raw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance
Database
Model Storage
Model
Family
Model
Form
29
Nuts and Bolts
• Ingestion
• Parsing
• Tokenization
• Normalization
• Vectorization
• Classification
• Clustering
• Pipelines
30
Putting the pieces together
Data Loader
Text
Normalization
Text
Vectorization
Feature
Transformation
Estimator
Data Loader
Feature Union
Estimator
Text
Normalization
Document
Features
Text Extraction
Summary
Vectorization
Article
Vectorization
Concept Features
Metadata Features
Dict Vectorizer
31
Open Source Tools
(in Python)
32
Open Source Tools
Feature NLTK Scikit-Learn Gensim Pattern SpaCy
Ingestion tools ✓ ✓
Classifiers ✓ ✓ ✓
Topic Modeling ✓ ✓ ✓
Vectorization ✓ ✓ ✓
Tokenization ✓ ✓ ✓ ✓ ✓
Parsing ✓ ✓ ✓
TF-IDF ✓ ✓ ✓
Pipelines ✓
33
Open Source Tools
For ingestion
Requests, BeautifulSoup => Baleen
For preprocessing and normalization
NLTK => Minke
For machine learning
Scikit-Learn, Gensim, Spacy => Yellowbrick
34
Main take-aways
NLP is...
• different from numerical ML (but also the same).
• not about beautiful, bespoke algorithms.
• hard and messy work.
• necessary.
35
Thank you!
Contact Info:
Rebecca Bilbro
rebecca.bilbro@bytecubed.com
github.com/rebeccabilbro
twitter.com/rebeccabilbro

Weitere ähnliche Inhalte

Andere mochten auch

Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability ProjectRebecca Bilbro
 
Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...
Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...
Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...Fiona Campbell
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Intorduction to Neuro Linguistic Programming (NLP)
Intorduction to Neuro Linguistic Programming (NLP)Intorduction to Neuro Linguistic Programming (NLP)
Intorduction to Neuro Linguistic Programming (NLP)eohart
 
Nlp at work
Nlp at workNlp at work
Nlp at workAlx Jac
 
NLP& Bigdata. Motivation and Action
NLP& Bigdata. Motivation and ActionNLP& Bigdata. Motivation and Action
NLP& Bigdata. Motivation and ActionSarath P R
 
The Truth About Nlp & Hypnosis
The Truth About Nlp & HypnosisThe Truth About Nlp & Hypnosis
The Truth About Nlp & HypnosisLinda Ferguson
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Introduction to nlp 2014
Introduction to nlp 2014Introduction to nlp 2014
Introduction to nlp 2014Grant Hamel
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderSebastian Ruder
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
Be ahead of the tech curve, not behind! 5 nonprofit tech trends for 2017
Be ahead of the tech curve, not behind! 5 nonprofit tech trends for 2017 Be ahead of the tech curve, not behind! 5 nonprofit tech trends for 2017
Be ahead of the tech curve, not behind! 5 nonprofit tech trends for 2017 TechSoup Canada
 
Price iz NLP Centra - Pedja Jovanovic
Price iz NLP Centra - Pedja JovanovicPrice iz NLP Centra - Pedja Jovanovic
Price iz NLP Centra - Pedja JovanovicNLP Centar Beograd
 
Learning Nlp
Learning NlpLearning Nlp
Learning NlpMike Hill
 
Language of Influence and Persuasion - introduction to the NLP Milton Model
Language of Influence and Persuasion - introduction to the NLP Milton ModelLanguage of Influence and Persuasion - introduction to the NLP Milton Model
Language of Influence and Persuasion - introduction to the NLP Milton ModelFiona Campbell
 

Andere mochten auch (20)

Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
 
Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...
Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...
Four ‘Magic’ Questions that Help Resolve Most Problems - Introduction to The ...
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Intorduction to Neuro Linguistic Programming (NLP)
Intorduction to Neuro Linguistic Programming (NLP)Intorduction to Neuro Linguistic Programming (NLP)
Intorduction to Neuro Linguistic Programming (NLP)
 
Nlp at work
Nlp at workNlp at work
Nlp at work
 
NLP in 10 lines of code
NLP in 10 lines of codeNLP in 10 lines of code
NLP in 10 lines of code
 
NLP& Bigdata. Motivation and Action
NLP& Bigdata. Motivation and ActionNLP& Bigdata. Motivation and Action
NLP& Bigdata. Motivation and Action
 
NLP in English
NLP in EnglishNLP in English
NLP in English
 
Intro To NlP
Intro To NlPIntro To NlP
Intro To NlP
 
The Truth About Nlp & Hypnosis
The Truth About Nlp & HypnosisThe Truth About Nlp & Hypnosis
The Truth About Nlp & Hypnosis
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Introduction to nlp 2014
Introduction to nlp 2014Introduction to nlp 2014
Introduction to nlp 2014
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Why Learn NLP or go on an NLP Training : Webinair
 Why Learn NLP or go on an NLP Training : Webinair Why Learn NLP or go on an NLP Training : Webinair
Why Learn NLP or go on an NLP Training : Webinair
 
Be ahead of the tech curve, not behind! 5 nonprofit tech trends for 2017
Be ahead of the tech curve, not behind! 5 nonprofit tech trends for 2017 Be ahead of the tech curve, not behind! 5 nonprofit tech trends for 2017
Be ahead of the tech curve, not behind! 5 nonprofit tech trends for 2017
 
Price iz NLP Centra - Pedja Jovanovic
Price iz NLP Centra - Pedja JovanovicPrice iz NLP Centra - Pedja Jovanovic
Price iz NLP Centra - Pedja Jovanovic
 
Learning Nlp
Learning NlpLearning Nlp
Learning Nlp
 
Language of Influence and Persuasion - introduction to the NLP Milton Model
Language of Influence and Persuasion - introduction to the NLP Milton ModelLanguage of Influence and Persuasion - introduction to the NLP Milton Model
Language of Influence and Persuasion - introduction to the NLP Milton Model
 

Ähnlich wie NLP for Everyday People

Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search SolutionsFindwise
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solrLucidworks (Archived)
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4jRobotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4jKevin Watters
 
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...Lucidworks
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
Scaling out federated queries for Life Sciences Data In Production
Scaling out federated queries for Life Sciences Data In ProductionScaling out federated queries for Life Sciences Data In Production
Scaling out federated queries for Life Sciences Data In ProductionDieter De Witte
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
Domain specific languages and Scala
Domain specific languages and ScalaDomain specific languages and Scala
Domain specific languages and ScalaFilip Krikava
 

Ähnlich wie NLP for Everyday People (20)

Craft of coding
Craft of codingCraft of coding
Craft of coding
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4jRobotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
 
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Taming Text
Taming TextTaming Text
Taming Text
 
Betabit - syrwag 2018-03-28
Betabit - syrwag 2018-03-28Betabit - syrwag 2018-03-28
Betabit - syrwag 2018-03-28
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
How to choose a database
How to choose a databaseHow to choose a database
How to choose a database
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
Scaling out federated queries for Life Sciences Data In Production
Scaling out federated queries for Life Sciences Data In ProductionScaling out federated queries for Life Sciences Data In Production
Scaling out federated queries for Life Sciences Data In Production
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
HLT
HLTHLT
HLT
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Domain specific languages and Scala
Domain specific languages and ScalaDomain specific languages and Scala
Domain specific languages and Scala
 

Mehr von Rebecca Bilbro

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionRebecca Bilbro
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Rebecca Bilbro
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine LearningRebecca Bilbro
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyRebecca Bilbro
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsRebecca Bilbro
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusRebecca Bilbro
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleRebecca Bilbro
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scaleRebecca Bilbro
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistRebecca Bilbro
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black BoxRebecca Bilbro
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusRebecca Bilbro
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersRebecca Bilbro
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection processRebecca Bilbro
 

Mehr von Rebecca Bilbro (20)

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in Production
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual Consistency
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big Models
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scale
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in space
Words in spaceWords in space
Words in space
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
 
Camlis
CamlisCamlis
Camlis
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black Box
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword Corpus
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
 

Kürzlich hochgeladen

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

NLP for Everyday People

  • 1. 1 Natural Language Processing for Everyday People Dr. Rebecca Bilbro, Lead Data Scientist
  • 2. 2 Rebecca Bilbro Lead Data Scientist, Bytecubed Organizer, Data Science DC Faculty, Georgetown Univ. & District Data Labs rebecca.bilbro@bytecubed.com github.com/rebeccabilbro twitter.com/rebeccabilbro
  • 3. 3 Main take-aways NLP is... • different from numerical ML (but also the same). • not about beautiful, bespoke algorithms. • hard and messy work. • necessary.
  • 4. 4 Overview • Everyday NLP • Language aware applications • Nuts and bolts of NLP • Open source tools • Questions
  • 6. 6 • Summarization • Reference Resolution • Machine Translation • Language Generation • Language Understanding • Document Classification • Author Identification • Part of Speech Tagging • Question Answering • Information Extraction • Information Retrieval • Speech Recognition • Sense Disambiguation • Topic Recognition • Relationship Detection • Named Entity Recognition Everyday NLP Problems
  • 7. 7 Coreference resolution Question answering (QA) Part-of-speech (POS) tagging Word sense disambiguation (WSD) Paraphrase Named entity recognition (NER) Parsing Summarization Information extraction (IE) Machine translation (MT) Dialog Sentiment analysis mostly solved making good progress still really hard Spam detection Let’s go to Agra! Buy V1AGRA … ✓ ✗ Colorless green ideas sleep furiously. ADJ ADJ NOUN VERB ADV Einstein met with UN officials in Princeton PERSON ORG LOC You’re invited to our dinner party, Friday May 27 at 8:30 Party May 27 add Best roast chicken in San Francisco! The waiter ignored us for 20 minutes. Carter told Mubarak he shouldn’t run again. I need new batteries for my mouse. The 13th Shanghai International Film Festival… 第13届上海国际电影节开幕… The Dow Jones is up Housing prices rose Economy is good Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness? I can see Alcatraz from the window! XYZ acquired ABC yesterday ABC has been taken over by XYZ Where is Citizen Kane playing in SF? Castro Theatre at 7:30. Do you want a ticket? The S&P500 jumped Dan Jurafsky
  • 8. 8
  • 10. 10 How to build a data product Data Ingestion Data Wrangling Computational Data Store WORM Store Data ExplorationFeature Analysis Model Storage Model Fitting Model Evaluation and Selection Application Feedback
  • 11. 11 How to build a language aware application Data Ingestion Wrangling Preprocessing WORM Store Analytics Corpus Reader Preprocessing Corpus Reader Raw Corpus Tokenized Corpus Text Vectorization Model Fitting Model Store Application Lexical Resources Feedback
  • 12. 12 Language Aware Applications • Not automagic • Take in text data as input • Parse into composite parts • Compute upon composites • Derive model • Introduce new data • Predict • Deliver result • Ingest feedback • Do it all again (but better!)
  • 13. 13 Language Aware Applications Challenges: • Ingestion • Messy data • Language Ambiguity • High dimensional feature space • Computation speed • Cross-validation • Pipelines
  • 14. 14 Language Aware Applications Requirements: • Robust data management • Domain-specific corpora • Normalization • Vectorization • Streaming • Dimensionality reduction • Visualization • Repeatability
  • 16. 16 Nuts and Bolts • Ingestion • Parsing • Tokenization • Normalization • Vectorization • Classification • Clustering • Pipelines
  • 17. 17 Nuts and Bolts • Ingestion • Parsing • Tokenization • Normalization • Vectorization • Classification • Clustering • Pipelines
  • 18. 18 High Level Data Ingestion Preprocessing Transformer Pre-processed CorpusReader WORM Storage (raw XML/HTML/JSON/CSV) Tokenized Corpus Post-processed CorpusReader
  • 19. 19 Nuts and Bolts • Ingestion • Parsing • Tokenization • Normalization • Vectorization • Classification • Clustering • Pipelines
  • 20. 20 Parse from nltk import sent_tokenize def sents(self, fileids=None, categories=None): """ Uses built-in NLTK sentence tokenizer to extract sentences from Paragraphs. """ for paragraph in self.paras(fileids, categories): for sentence in sent_tokenize(paragraph): yield sentence
  • 21. 21 Tokenize from nltk import wordpunct_tokenize def words(self, fileids=None, categories=None): """ Use built-in NLTK word tokenizer to extract tokens from sentences. """ for sentence in self.sents(fileids, categories): for token in wordpunct_tokenize(sentence): yield token
  • 22. 22 Tag from nltk import pos_tag def tokenize(self, fileids=None, categories=None): """ Segments, tokenizes, and tags a document in the corpus. """ for paragraph in self.corpus.paras(fileids=fileid): yield [ nltk.pos_tag(nltk.wordpunct_tokenize(sent)) for sent in nltk.sent_tokenize(paragraph) ]
  • 23. 23 Nuts and Bolts • Ingestion • Parsing • Tokenization • Normalization • Vectorization • Classification • Clustering • Pipelines
  • 24. 24 Normalization import nltk import string def tokenize(text): stem = nltk.stem.SnowballStemmer('english') text = text.lower() for token in nltk.word_tokenize(text): if token in string.punctuation: continue yield stem.stem(token) corpus = [ "The elephant sneezed at the sight of potatoes.", "Bats can see via echolocation. See the bat sight sneeze!", "Wondering, she opened the door to the studio.", ]
  • 25. 25 Nuts and Bolts • Ingestion • Parsing • Tokenization • Normalization • Vectorization • Classification • Clustering • Pipelines
  • 26. 26 Vectorization The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio. at bat can door echolocation elephant of open potato see she sight sneeze studio the to via w onder Multiple Options! Bag-of-words · One-hot encoding · TFIDF · Distributed representation
  • 27. 27 Nuts and Bolts • Ingestion • Parsing • Tokenization • Normalization • Vectorization • Classification • Clustering • Pipelines
  • 28. 28 Machine Learning on Text Data Management Layer Raw Data Feature Engineering Hyperparameter Tuning Algorithm Selection Model Selection Triples Instance Database Model Storage Model Family Model Form
  • 29. 29 Nuts and Bolts • Ingestion • Parsing • Tokenization • Normalization • Vectorization • Classification • Clustering • Pipelines
  • 30. 30 Putting the pieces together Data Loader Text Normalization Text Vectorization Feature Transformation Estimator Data Loader Feature Union Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer
  • 32. 32 Open Source Tools Feature NLTK Scikit-Learn Gensim Pattern SpaCy Ingestion tools ✓ ✓ Classifiers ✓ ✓ ✓ Topic Modeling ✓ ✓ ✓ Vectorization ✓ ✓ ✓ Tokenization ✓ ✓ ✓ ✓ ✓ Parsing ✓ ✓ ✓ TF-IDF ✓ ✓ ✓ Pipelines ✓
  • 33. 33 Open Source Tools For ingestion Requests, BeautifulSoup => Baleen For preprocessing and normalization NLTK => Minke For machine learning Scikit-Learn, Gensim, Spacy => Yellowbrick
  • 34. 34 Main take-aways NLP is... • different from numerical ML (but also the same). • not about beautiful, bespoke algorithms. • hard and messy work. • necessary.
  • 35. 35 Thank you! Contact Info: Rebecca Bilbro rebecca.bilbro@bytecubed.com github.com/rebeccabilbro twitter.com/rebeccabilbro