SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Deep Learning
and Text Mining
Will Stanton
Ski Hackathon Kickoff Ceremony, Feb 28, 2015
We have a problem
● At Return Path, we process billions of emails
a year, from tons of senders
● We want to tag and cluster senders
○ Industry verticals (e-commerce, apparel, travel, etc.)
○ Type of customers they sell to (luxury, soccer moms,
etc.)
○ Business model (daily deals, flash sales, etc.)
● It’s too much to do by hand!
What to do?
● Standard approaches aren’t great
○ Bag of words classification model (document-term matrix, LSA, LDA)
■ Have to manually label lots of cases first
■ Difficult with lots of data (especially LDA)
○ Bag of words clustering
■ Can’t easily put one company into multiple categories (ie. more
general tagging)
■ Needs lots of tuning
● How about deep learning neural networks?
○ Very trendy. Let’s try it!
Neural Networks
Input
Layer First
Hidden
Layer
Second
Hidden
Layer
Output
Layer
Inputs
x = (x1, x2, x3)
Output
y
● Machine learning algorithms
modeled after the way the human
brain works
● Learn patterns and structure by
passing training data through
“neurons”
● Useful for classification,
regression, feature extraction, etc.
Deep Learning
● Neural networks with lots of hidden layers
(hundreds)
● State of the art for machine translation, facial
recognition, text classification, speech
recognition
○ Tasks with real deep structure, that humans do
automatically but computers struggle with
○ Should be good for company tagging!
Distributed Representations
Pixels
EdgesCollection of faces
Shapes
Typical facial types
(features)
● Human brain uses distributed representations
● We can use deep learning to do the same thing with
words (letters -> words -> phrases -> sentences -> …)
Deep Learning Challenges
● Computationally difficult to train (ie. slow)
○ Each hidden layer means more parameters
○ Each feature means more parameters
● Real human-generated text has a near-
infinite number of features and data
○ ie. slow would be a problem
● Solution: use word2vec
word2vec
● Published by scientists at Google in 2013
● Python implementation in 2014
○ gensim library
● Learns distributed vector representations
of words (“word to vec”) using a neural net
○ NOTE for hardcore experts: word2vec does not strictly or necessarily train a deep neural
net, but it uses deep learning technology (distributed representations, backpropagation,
stochastic gradient descent, etc.) and is based on a series of deep learning papers
What is the output?
● Distributed vector representations of words
○ each word is encoded as a vector of floats
○ vecqueen
= (0.2, -0.3, .7, 0, … , .3)
○ vecwoman
= (0.1, -0.2, .6, 0.1, … , .2)
○ length of the vectors = dimension of the word
representation
○ key concept of word2vec: words with similar
vectors have a similar meaning (context)
word2vec Features
● Very fast and scalable
○ Google trained it on 100’s of billions of words
● Uncovers deep latent structure of word
relationships
○ Can solve analogies like King::Man as Queen::? or
Paris::France as Berlin::?
○ Can solve “one of these things is not like another”
○ Can be used for machine translation or automated
sentence completion
How does it work?
● Feed the algorithm (lots of) sentences
○ totally unsupervised learning
● word2vec trains a neural net that encodes
the context of words within sentences
○ “Skip-grams”: what is the probability that the word
“queen” appears 1 word after “woman”, 2 words
after, etc.
word2vec at Return Path
● At Return Path, we implemented word2vec
on data from our Consumer Data Stream
○ billions of email subject lines from millions of users
○ fed 30 million unique subject lines (300m words) and
sending domains into word2vec (using Python)
Lots of
subject
lines
word2vec word vectors insights
Grouping companies with word2vec
● Find daily deals sites like Groupon
[word for (word, score) in model.most_similar('groupon.com', topn = 100) if
'.com' in word]
['grouponmail.com.au', 'specialicious.com', 'livingsocial.com', 'deem.com',
'hitthedeals.com', 'grabone-mail-ie.com', 'grabone-mail.com', 'kobonaty.com',
'deals.com.au', 'coupflip.com', 'ouffer.com', 'wagjag.com']
● Find apparel sites like Gap
[word for (word, score) in model.most_similar('gap.com', topn = 100) if '.com'
in word]
['modcloth.com', 'bananarepublic.com', 'shopjustice.com', 'thelimited.com',
'jcrew.com', 'gymboree.com', 'abercrombie-email.com', 'express.com',
'hollister-email.com', 'abercrombiekids-email.com', 'thredup.com',
'neimanmarcusemail.com']
More word2vec applications
● Find relationships between products
● model.most_similar(positive=['iphone', 'galaxy'], negative=['apple']) =
‘samsung’
● ie. iphone::apple as galaxy::? samsung!
● Distinguish different companies
● model.doesnt_match(['sheraton','westin','aloft','walmart']) = ‘walmart’
● ie. Wal Mart does not match Sheraton, Westin, and Aloft hotels
● Other possibilities
○ Find different companies with similar marketing copy
○ Automatically construct high-performing subject lines
○ Many more...
Try it yourself
● C implementation exists, but I recommend
Python
○ gensim library: https://radimrehurek.com/gensim/
○ tutorial:http://radimrehurek.
com/gensim/models/word2vec.html
○ webapp to try it out as part of tutorial
○ Pretrained Google News and Freebase models:
https://code.google.com/p/word2vec/
○ Only takes 10 lines of code to get started!
Thanks for listening!
● Many thanks to:
○ Data Science Association and Level 3
○ Michael Walker for organizing
● Slides posted on http://will-stanton.com/
● Email me at will@will-stanton.com
● Return Path is hiring! Voted #2 best
midsized company to work for in the country
http://careers.returnpath.com/

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas Mikolov
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Learning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryLearning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionary
 
Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer Connect
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Practical Deep Learning for NLP
Practical Deep Learning for NLP Practical Deep Learning for NLP
Practical Deep Learning for NLP
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...
 

Andere mochten auch

NLP@Work Conference: email persuasion
NLP@Work Conference: email persuasionNLP@Work Conference: email persuasion
NLP@Work Conference: email persuasion
evolutionpd
 

Andere mochten auch (20)

Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
 
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Thinking about nlp
Thinking about nlpThinking about nlp
Thinking about nlp
 
NLP
NLPNLP
NLP
 
NLP@Work Conference: email persuasion
NLP@Work Conference: email persuasionNLP@Work Conference: email persuasion
NLP@Work Conference: email persuasion
 
Functional Reactive Programming in Clojurescript
Functional Reactive Programming in ClojurescriptFunctional Reactive Programming in Clojurescript
Functional Reactive Programming in Clojurescript
 
AI Reality: Where are we now? Data for Good? - Bill Boorman
AI Reality: Where are we now? Data for Good? - Bill  BoormanAI Reality: Where are we now? Data for Good? - Bill  Boorman
AI Reality: Where are we now? Data for Good? - Bill Boorman
 
Using Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From ResumesUsing Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From Resumes
 
Monads in Clojure
Monads in ClojureMonads in Clojure
Monads in Clojure
 
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
RでTwitterテキストマイニング
RでTwitterテキストマイニングRでTwitterテキストマイニング
RでTwitterテキストマイニング
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析
 
Online algorithms in Machine Learning
Online algorithms in Machine LearningOnline algorithms in Machine Learning
Online algorithms in Machine Learning
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining
 
SVM&R with Yaruo!!
SVM&R with Yaruo!!SVM&R with Yaruo!!
SVM&R with Yaruo!!
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 

Ähnlich wie Deep Learning and Text Mining

Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015
lbishal
 

Ähnlich wie Deep Learning and Text Mining (20)

Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
Cloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxCloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptx
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdf
 
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
 
Cloud Study Jam[1st OCT] gdscgtbit.pptx
Cloud Study Jam[1st OCT] gdscgtbit.pptxCloud Study Jam[1st OCT] gdscgtbit.pptx
Cloud Study Jam[1st OCT] gdscgtbit.pptx
 
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
 
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en Azure
 
National Wildlife Federation- OMS- Dreamcore 2011
National Wildlife Federation- OMS- Dreamcore 2011National Wildlife Federation- OMS- Dreamcore 2011
National Wildlife Federation- OMS- Dreamcore 2011
 
MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Brownfield Domain Driven Design
Brownfield Domain Driven DesignBrownfield Domain Driven Design
Brownfield Domain Driven Design
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Knowledge graphs + Chatbots with Neo4j
Knowledge graphs + Chatbots with Neo4jKnowledge graphs + Chatbots with Neo4j
Knowledge graphs + Chatbots with Neo4j
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler
 
Building an An AI Startup with MongoDB at x.ai
Building an An AI Startup with MongoDB at x.aiBuilding an An AI Startup with MongoDB at x.ai
Building an An AI Startup with MongoDB at x.ai
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Deep Learning and Text Mining

  • 1. Deep Learning and Text Mining Will Stanton Ski Hackathon Kickoff Ceremony, Feb 28, 2015
  • 2. We have a problem ● At Return Path, we process billions of emails a year, from tons of senders ● We want to tag and cluster senders ○ Industry verticals (e-commerce, apparel, travel, etc.) ○ Type of customers they sell to (luxury, soccer moms, etc.) ○ Business model (daily deals, flash sales, etc.) ● It’s too much to do by hand!
  • 3. What to do? ● Standard approaches aren’t great ○ Bag of words classification model (document-term matrix, LSA, LDA) ■ Have to manually label lots of cases first ■ Difficult with lots of data (especially LDA) ○ Bag of words clustering ■ Can’t easily put one company into multiple categories (ie. more general tagging) ■ Needs lots of tuning ● How about deep learning neural networks? ○ Very trendy. Let’s try it!
  • 4. Neural Networks Input Layer First Hidden Layer Second Hidden Layer Output Layer Inputs x = (x1, x2, x3) Output y ● Machine learning algorithms modeled after the way the human brain works ● Learn patterns and structure by passing training data through “neurons” ● Useful for classification, regression, feature extraction, etc.
  • 5. Deep Learning ● Neural networks with lots of hidden layers (hundreds) ● State of the art for machine translation, facial recognition, text classification, speech recognition ○ Tasks with real deep structure, that humans do automatically but computers struggle with ○ Should be good for company tagging!
  • 6. Distributed Representations Pixels EdgesCollection of faces Shapes Typical facial types (features) ● Human brain uses distributed representations ● We can use deep learning to do the same thing with words (letters -> words -> phrases -> sentences -> …)
  • 7. Deep Learning Challenges ● Computationally difficult to train (ie. slow) ○ Each hidden layer means more parameters ○ Each feature means more parameters ● Real human-generated text has a near- infinite number of features and data ○ ie. slow would be a problem ● Solution: use word2vec
  • 8. word2vec ● Published by scientists at Google in 2013 ● Python implementation in 2014 ○ gensim library ● Learns distributed vector representations of words (“word to vec”) using a neural net ○ NOTE for hardcore experts: word2vec does not strictly or necessarily train a deep neural net, but it uses deep learning technology (distributed representations, backpropagation, stochastic gradient descent, etc.) and is based on a series of deep learning papers
  • 9. What is the output? ● Distributed vector representations of words ○ each word is encoded as a vector of floats ○ vecqueen = (0.2, -0.3, .7, 0, … , .3) ○ vecwoman = (0.1, -0.2, .6, 0.1, … , .2) ○ length of the vectors = dimension of the word representation ○ key concept of word2vec: words with similar vectors have a similar meaning (context)
  • 10. word2vec Features ● Very fast and scalable ○ Google trained it on 100’s of billions of words ● Uncovers deep latent structure of word relationships ○ Can solve analogies like King::Man as Queen::? or Paris::France as Berlin::? ○ Can solve “one of these things is not like another” ○ Can be used for machine translation or automated sentence completion
  • 11. How does it work? ● Feed the algorithm (lots of) sentences ○ totally unsupervised learning ● word2vec trains a neural net that encodes the context of words within sentences ○ “Skip-grams”: what is the probability that the word “queen” appears 1 word after “woman”, 2 words after, etc.
  • 12. word2vec at Return Path ● At Return Path, we implemented word2vec on data from our Consumer Data Stream ○ billions of email subject lines from millions of users ○ fed 30 million unique subject lines (300m words) and sending domains into word2vec (using Python) Lots of subject lines word2vec word vectors insights
  • 13. Grouping companies with word2vec ● Find daily deals sites like Groupon [word for (word, score) in model.most_similar('groupon.com', topn = 100) if '.com' in word] ['grouponmail.com.au', 'specialicious.com', 'livingsocial.com', 'deem.com', 'hitthedeals.com', 'grabone-mail-ie.com', 'grabone-mail.com', 'kobonaty.com', 'deals.com.au', 'coupflip.com', 'ouffer.com', 'wagjag.com'] ● Find apparel sites like Gap [word for (word, score) in model.most_similar('gap.com', topn = 100) if '.com' in word] ['modcloth.com', 'bananarepublic.com', 'shopjustice.com', 'thelimited.com', 'jcrew.com', 'gymboree.com', 'abercrombie-email.com', 'express.com', 'hollister-email.com', 'abercrombiekids-email.com', 'thredup.com', 'neimanmarcusemail.com']
  • 14. More word2vec applications ● Find relationships between products ● model.most_similar(positive=['iphone', 'galaxy'], negative=['apple']) = ‘samsung’ ● ie. iphone::apple as galaxy::? samsung! ● Distinguish different companies ● model.doesnt_match(['sheraton','westin','aloft','walmart']) = ‘walmart’ ● ie. Wal Mart does not match Sheraton, Westin, and Aloft hotels ● Other possibilities ○ Find different companies with similar marketing copy ○ Automatically construct high-performing subject lines ○ Many more...
  • 15. Try it yourself ● C implementation exists, but I recommend Python ○ gensim library: https://radimrehurek.com/gensim/ ○ tutorial:http://radimrehurek. com/gensim/models/word2vec.html ○ webapp to try it out as part of tutorial ○ Pretrained Google News and Freebase models: https://code.google.com/p/word2vec/ ○ Only takes 10 lines of code to get started!
  • 16. Thanks for listening! ● Many thanks to: ○ Data Science Association and Level 3 ○ Michael Walker for organizing ● Slides posted on http://will-stanton.com/ ● Email me at will@will-stanton.com ● Return Path is hiring! Voted #2 best midsized company to work for in the country http://careers.returnpath.com/