2. 2
Rebecca Bilbro
Lead Data Scientist, Bytecubed
Organizer, Data Science DC
Faculty, Georgetown Univ.
& District Data Labs
rebecca.bilbro@bytecubed.com
github.com/rebeccabilbro
twitter.com/rebeccabilbro
3. 3
Main take-aways
NLP is...
• different from numerical ML (but also the same).
• not about beautiful, bespoke algorithms.
• hard and messy work.
• necessary.
4. 4
Overview
• Everyday NLP
• Language aware applications
• Nuts and bolts of NLP
• Open source tools
• Questions
6. 6
• Summarization
• Reference Resolution
• Machine Translation
• Language Generation
• Language Understanding
• Document Classification
• Author Identification
• Part of Speech Tagging
• Question Answering
• Information Extraction
• Information Retrieval
• Speech Recognition
• Sense Disambiguation
• Topic Recognition
• Relationship Detection
• Named Entity Recognition
Everyday NLP Problems
7. 7
Coreference resolution
Question answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguation (WSD)
Paraphrase
Named entity recognition (NER)
Parsing
Summarization
Information extraction (IE)
Machine translation (MT)
Dialog
Sentiment analysis
mostly solved
making good progress
still really hard
Spam detection
Let’s go to Agra!
Buy V1AGRA …
✓
✗
Colorless green ideas sleep furiously.
ADJ ADJ NOUN VERB ADV
Einstein met with UN officials in Princeton
PERSON ORG LOC
You’re invited to our dinner
party, Friday May 27 at 8:30
Party
May 27
add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.
I need new batteries for my mouse.
The 13th
Shanghai International Film Festival…
第13届上海国际电影节开幕…
The Dow Jones is up
Housing prices rose
Economy is
good
Q. How effective is ibuprofen in reducing
fever in patients with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Where is Citizen Kane playing in SF?
Castro Theatre at 7:30. Do
you want a ticket?
The S&P500 jumped
Dan Jurafsky
10. 10
How to build a data product
Data Ingestion Data Wrangling Computational
Data Store
WORM Store
Data ExplorationFeature Analysis
Model Storage
Model Fitting
Model Evaluation
and Selection
Application
Feedback
11. 11
How to build a language aware application
Data Ingestion Wrangling Preprocessing
WORM Store
Analytics
Corpus Reader
Preprocessing
Corpus Reader
Raw Corpus
Tokenized
Corpus
Text
Vectorization
Model Fitting
Model Store
Application
Lexical
Resources
Feedback
12. 12
Language Aware Applications
• Not automagic
• Take in text data as input
• Parse into composite parts
• Compute upon composites
• Derive model
• Introduce new data
• Predict
• Deliver result
• Ingest feedback
• Do it all again (but better!)
20. 20
Parse
from nltk import sent_tokenize
def sents(self, fileids=None, categories=None):
"""
Uses built-in NLTK sentence tokenizer to extract sentences from
Paragraphs.
"""
for paragraph in self.paras(fileids, categories):
for sentence in sent_tokenize(paragraph):
yield sentence
21. 21
Tokenize
from nltk import wordpunct_tokenize
def words(self, fileids=None, categories=None):
"""
Use built-in NLTK word tokenizer to extract tokens from
sentences.
"""
for sentence in self.sents(fileids, categories):
for token in wordpunct_tokenize(sentence):
yield token
22. 22
Tag
from nltk import pos_tag
def tokenize(self, fileids=None, categories=None):
"""
Segments, tokenizes, and tags a document in the corpus.
"""
for paragraph in self.corpus.paras(fileids=fileid):
yield [
nltk.pos_tag(nltk.wordpunct_tokenize(sent))
for sent in nltk.sent_tokenize(paragraph)
]
24. 24
Normalization
import nltk
import string
def tokenize(text):
stem = nltk.stem.SnowballStemmer('english')
text = text.lower()
for token in nltk.word_tokenize(text):
if token in string.punctuation: continue
yield stem.stem(token)
corpus = [
"The elephant sneezed at the sight of potatoes.",
"Bats can see via echolocation. See the bat sight sneeze!",
"Wondering, she opened the door to the studio.",
]
26. 26
Vectorization
The elephant sneezed
at the sight of potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she opened
the door to the studio.
at
bat
can
door
echolocation
elephant of
open
potato
see
she
sight
sneeze
studio
the to via
w
onder
Multiple Options!
Bag-of-words · One-hot encoding · TFIDF · Distributed representation
28. 28
Machine Learning on Text
Data Management Layer
Raw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance
Database
Model Storage
Model
Family
Model
Form
30. 30
Putting the pieces together
Data Loader
Text
Normalization
Text
Vectorization
Feature
Transformation
Estimator
Data Loader
Feature Union
Estimator
Text
Normalization
Document
Features
Text Extraction
Summary
Vectorization
Article
Vectorization
Concept Features
Metadata Features
Dict Vectorizer
33. 33
Open Source Tools
For ingestion
Requests, BeautifulSoup => Baleen
For preprocessing and normalization
NLTK => Minke
For machine learning
Scikit-Learn, Gensim, Spacy => Yellowbrick
34. 34
Main take-aways
NLP is...
• different from numerical ML (but also the same).
• not about beautiful, bespoke algorithms.
• hard and messy work.
• necessary.