2. Who am I?
Chris Fregly, Research Scientist @ PipelineIO, San Francisco
Previously, Engineer @ Netflix, Databricks, and IBM Spark
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow Meetup
Author @ Advanced Spark (advancedspark.com)
7. Confession #1
I Failed Linguistics in College!
Chose Pass/Fail Option
(90 (mid-term) + 70 (final)) / 2 = 80 = C+
How did a C+ turn into an F?
ZER0 (0) CLASS PARTICIPATION?!
8. Confession #2
I Hated Statistics in College
2 Degrees: Mechanical + Manufacturing Engg
Approximations were Bad!
I Wasn’t a Fluffy Physics Major
Though, I Kinda Wish I Was!
9. Wait… Please Don’t Leave!
I’m Older and Wiser Now
Approximate is the New Exact
Computational Linguistics and NLP are My Jam!
11. What is Tensorflow?
General Purpose Numerical Computation Engine
Happens to be good for neural nets!
Tooling
Tensorboard (port 6006 == `goog`) à
DAG-based like Spark!
Computation graph is logical plan
Stored in Protobuf’s
TF converts logical -> physical plan
Lots of Libraries
TFLearn (Tensorflow’s Scikit-learn Impl)
Tensorflow Serving (Prediction Layer) à ^^
Distributed and GPU-Optimized
12. What are Neural Networks?
Like All ML, Goal is to Minimize Loss (Error)
Error relative to known outcome of labeled data
Mostly Supervised Learning Classification
Labeled training data
Training Steps
Step 1: Randomly Guess Input Weights
Step 2: Calculate Error Against Labeled Data
Step 3: Determine Gradient Value, +/- Direction
Step 4: Back-propagateGradient to Update Each Input Weight
Step 5: Repeat Step 1 with New Weights until Convergence
Activation
Function
13. Activation Functions
Goal: Learn and Train a Model on Input Data
Non-Linear Functions
Find Non-Linear Fit of Input Data
Common Activation Functions
Sigmoid Function (sigmoid)
{0, 1}
Hyperbolic Tangent (tanh)
{-1, 1}
17. Convolutional Neural Networks
Feed-forward
Do not form a cycle
Apply Many Layers (aka. Filters) to Input
Each Layer/Filter Picks up on Features
Features not necessarily human-grokkable
Examples of Human-grokkable Filters
3 color filters: RGB
Moving AVG for time series
Brute Force
Try Diff numLayers & layerSizes
18. CNN Use Case: Stitch Fix
Stitch Fix Also Uses NLP to Analyze Return/Reject Comments
StitchFix Strata Conf SF 2016:
Using Deep Learning to Create New Clothing Styles!
19. Recurrent Neural Networks
Forms a Cycle (vs. Feed-forward)
Maintains State over Time
Keep track of context
Learns sequential patterns
Decay over time
Use Cases
Speech
Text/NLP Prediction
20. RNN Sequences
Input: Image
Output: Classification
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Input: Image
Output: Text (Captions)
Input: Text
Output: Class (Sentiment)
Input: Text (English)
Output: Text (Spanish)
Input
Layer
Hidden
Layer
Output
Layer
21. Character-based RNNs
Tokens are Characters vs. Words/Phrases
Microsoft trains ever 3 characters
Less Combination of Possible Neighbors
Only 26 alpha character tokens vs. millions of word tokens
Preserves state
between
1st and 2nd ‘l’
improves prediction
22. Long Short Term Memory (LSTM)
More Complex
State Update
Function
than
Vanilla RNN
26. Use Cases
Document Summary
TextRank: TF/IDF + PageRank
Article Classification and Similarity
LDA: calculate top `k` topic distribution
Machine Translation
word2vec: compare word embedding vectors
Must Convert Text to Numbers!
27. Core Concepts
Corpus
Collection of text
ie. Documents, articles, genetic codes
Embeddings
Tokens represented/embedded in vector space
Learned, hidden features (~PCA, SVD)
Similar tokens cluster together, analogies cluster apart
k-skip-gram
Skip k neighbors when defining tokens
n-gram
Treat n consecutive tokens as a single token
Composable:
1-skip, bi-gram
(every other word)
28. Parsers and POS Taggers
Describe grammatical sentence structure
Requires context of entire sentence
Helps reason about sentence
80% obvious, simple token neighbors
Major bottleneck in NLP pipeline!
29. Pre-trained Parsers and Taggers
Penn Treebank
Parser and Part-of-Speech Tagger
Human-annotated (!)
Trained on 4.5 million words
Parsey McParseface
Trained by SyntaxNet
30. Feature Engineering
Lower-case
Preserve proper nouns using carat (`^`)
“MLconf” => “^m^lconf”
“Varsity” => “^varsity”
Encode Common N-grams (Phrases)
Create a single token using underscore (`_`)
“Senior Developer” => “senior_developer”
Stemming and Lemmatization
Try to avoid: let the neural network figure this out
Can preserve part of speech (POS) using “_noun”, “_verb”
“banking” => “banking_verb”
32. Count-based Models
Goal: Convert Text to Vector of Neighbor Co-occurrences
Bag of Words (BOW)
Simple hashmap with word counts
Loses neighbor context
Term Frequency / Inverse Document Frequency (TF/IDF)
Normalizes based on token frequency
GloVe
Matrix factorization on co-occurrence matrix
Highly parallelizable, reduce dimensions, capture global co-occurrence stats
Log smoothing of probability ratios
Stores word vector diffs for fast analogy lookups
33. Neural-based Predictive Models
Goal: Predict Text using Learned Embedding Vectors
word2vec
Shallow neural network
Local: nearby words predict each other
Fixed word embedding vector size (ie. 300)
Optimizer: Mini-batch Stochastic Gradient Descent (SGD)
SyntaxNet
Deep(er) neural network
Global(er)
Not a Recurrent Neural Net (RNN)!
Can combine with BOW-based models (ie. word2vec CBOW)
34. word2vec
CBOW word2vec
Predict target word from source context
A single source context is an observation
Loses useful distribution information
Good for small datasets
Skip-gram word2vec (Inverse of CBOW)
Predict source context words from target word
Each (source context, target word) tuple is observation
Better for large datasets
36. *2vec
lda2vec
LDA (global) + word2vec (local)
From Chris Moody @ Stitch Fix
like2vec
Embedding-based Recommender
37. word2vec vs. GloVe
Both are Fundamentally Similar
Capture local co-occurrence statistics (neighbors)
Capture distance between embedding vector
(analogies)
GloVe
Count-based
Also captures global co-occurrence statistics
Requires upfront pass through entire dataset
38. SyntaxNet POS Tagging
Determine coarse-grained grammatical role of each word
Multiple contexts, multiple roles
Neural Net
Inputs: stack, buffer
Results: POS probability distro
Already
Tagged
39. SyntaxNet Dependency Parser
Determine fine-grained roles using grammatical relationships
“Transition-based”, Incremental Dependency Parser
Globally Normalized using Beam Search with Early Update
Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs
Fine-grained
Coarse-grained
40. SyntaxNet Use Case: Nutrition
Nutrition and Health Startup in SF (Stealth)
Using Google’s SyntaxNet
Rate Recipes and Menus by Nutritional Value
Correct
Incorrect
42. Thank You, Atlanta!
Chris Fregly, Research Scientist @ PipelineIO
All Source Code, Demos, and Docker Images
@ pipeline.io
Join the Global Meetup for all Slides and Videos
@ advancedspark.com