SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Word2Vec
Recall: Creating Numerical Features from Text
import pandas as pd
from sklearn.feature_extraction.text
import CountVectorizer
corpus = ['This is the first document.',
'This is the second
document.',
'And the third one.']
cv = CountVectorizer()
X = cv.fit_transform(corpus)
pd.DataFrame(X.toarray(),
columns=cv.get_feature_names())
Code Output
Count Vectorizer
Word/Document Vectors with CountVectorizer
Document Vectors
• Doc 0: [0, 1, 1, 1, 0, 0, 1, 0, 1]
• Doc 1: [0, 1, 0, 1, 0, 1, 1, 0, 1]
• Doc 2: [1, 0, 0, 0, 1, 0, 1, 1, 0]
Flip it around  Word Vectors
• and: [0, 0, 1]
• document: [1, 1,
0]
• first: [1, 0, 0]
• is: [1, 1, 0]
• one: [0, 0, 1]
• second: [0, 1, 0]
• the: [1, 1, 1]
• third: [0, 0, 1]
• this: [1, 1, 0]
Why Word Vectors?
• Represent Conceptual
“meaning” of words
• Word vectors close to each
other have similar meaning
How to Use Word Vectors?
• Information Retrieval
▪ e.g. conceptual search queries, concepts related to “painting”
• Document Vectors
▪ A document vector is the average of its word vectors
• Machine Learning
▪ Document Classification (from document vectors)
▪ Document Clustering
• Recommendation
▪ Recommend similar documents to search query or sample documents
Can we do better than counts?
Binary/Counts
TFIDF
Dimensionality
Reduction
Word2Vec
Okay
Best!
Better
• Answer: YES!
• Problems with counts:
▪ Limited information
– Possible Resolution: TFIDF
▪ Vectors HUGE for many documents
– Possible Resolution: Matrix Factorization
– Bag of Words  No Word Order
– Possible Resolution:
Neural Networks  Word2vec!
How to Use Word Vectors?
• Answer: Comparability with
human intuition
• Standard Baseline Tests:
▪ Analogies
▪ Ratings of Word Similarity
▪ Verb Tenses
▪ Country-Capital Relationships
Finding Better Word Vectors: Word2Vec
• Problem: Count vectors far too large for many
documents.
▪ Solution: Word2Vec reduces number of
dimensions (configurable e.g. 300)
• Problem: Bag of Words neglects word order.
▪ (Partial) Solution: Word2Vec trains on small
sequences of text (“context windows”)
Training Word2Vec
• Use a Neural Network on
Context Windows
• 2 main approaches for inputs
and labels:
▪ Skip-Grams
▪ Continuous Bag of
Words (CBOW)
Vectors usually similar, subtle
differences, also differences in
computational time
Training Word2Vec: Context Windows
• Input Layer: Context Windows
• Observations for word2vec: All context windows in a corpus
• Context window size determines size of relevant window around word:
▪ e.g.: Document: "The quick brown fox jumped over the lazy dog.”
▪ Window size: 4, target word “fox”.
– Window 1: "The quick brown fox jumped over the lazy dog."
– Window 2: "The quick brown fox jumped over the lazy dog.”
– Window 3: "The quick brown fox jumped over the lazy dog.”
– Window 4: "The quick brown fox jumped over the lazy dog."
Training Word2Vec: One-Hot Word Vectors
• We need to be able to represent a sequence of words as a vector
• Assign each word an index from 0 to V
▪ V is the size of the vocabulary aka # distinct words in the corpus
• A word vector is:
• 1 for the index of that word
• 0 for all other entries
• Called One-Hot Encoding
Training Word2Vec: One-Hot Context Windows
• Need vectors for context windows
• A window has vector that's the concatenation of its word vectors
• For window size d, the vector is of length (V x d)
▪ Only d entries (one for each word) will be nonzero (1s)
If window were “cat dog”
Training Word2Vec: SkipGrams
• SkipGrams is a neural network architecture that uses a word to predict the
words in the surrounding context, defined by the window size.
• Inputs:
▪ The middle word of the context window (one-hot encoded)
▪ Dimensionality: V
• Outputs:
▪ The other words of the context window (one-hot encoded)
▪ Dimensionality: (V x (d-1))
▪ Turn the crank!
Training Word2Vec: SkipGrams
• SkipGrams architecture:
Training Word2Vec: CBOW
• CBOW (continuous bag of words) uses the surrounding context (defined by
the window size) to predict the word.
• Inputs:
▪ The other words of the context window (one-hot encoded)
▪ Dimensionality: (V x (d-1))
• Outputs:
▪ The middle word of the context window (one-hot encoded)
▪ Dimensionality: V
Training Word2Vec: CBOW
• CBOW architecture:
Training Word2Vec: Dimension Reduction
• Number of nodes in hidden layer, N, is
a parameter
▪ It is the (reduced) dimensionality of our
resulting word vector space!
▪ Fit neural net  find weights matrix W
▪ Word Vectors: xN = WTx
▪ Checking dimensions:
– x: V x 1
– WT: N x V
– xN: N x 1
Training Word2Vec: What Happened?
• Learn words likely to appear near each
word
• This context information ultimately leads
to vectors for related words falling near
one another!
• Which gives us really good word
vectors! Aka “Word Embeddings”
Do I need to Train Word2Vec?
• Answer: NO!
• You can download pre-trained Word2Vec models trained on
massive corpora of data.
• Common example: Google News Vectors, 300 dimensional vectors
for 3 million words, trained on Google News articles.
• File containing vectors (1.5 GB) can be downloaded for free and
easily loaded into gensim.
Nice Properties of Word2Vec Embeddings
• word2vec (somewhat magically!)
captures nice geometric relations
between words
▪ e.g.: Analogies
– King is to Queen as Man is to Woman
– The vector between King and Queen is
the same as that between man and
woman!
– Works for all sorts of things: capitals,
cities, etc
Word2Vec with Gensim
Output:
Input:
from gensim.models.KeyedVectors import load_word2vec_format
google_model = load_word2vec_format(google_vec_file, binary=True)
# woman - man + king
print(google_model.most_similar(positive=['woman', 'king'], negative=['man'],
topn=3))
[('queen', 0.7118192911148071),
('monarch', 0.6189674139022827),
('princess', 0.5902431607246399)]
How can we Use Word2Vec?
• Vectors can be combined to create
features for documents
▪ e.g. Document Vector is average (or
sum) of its word vectors
• Use Document Vectors for ML on
Documents:
▪ Classification, Regression
▪ Clustering
▪ Recommendation
Comparing Word2Vec Embeddings
• How to compare 2 word vectors?
• Cosine Similarity
▪ Scaled angle between the vectors
▪ Vector length doesn’t matter
▪ Makes most sense for word vectors
▪ Why?
– e.g. [2, 2, 2] and [4, 4, 4] should be the
same vector
– It’s the ratios of frequencies that define
meaning
• Problem: Categorizing News Articles
• Is document about Politics? Sports?
Science/Tech? etc
• Approach:
▪ Word Vectors  Document Vectors
▪ Classification on Document Vectors
– Often KNN with Cosine Similarity
Word Vector Application: Text Classification
Word Vector Application: Text Clustering
• Problem: Grouping Similar Emails
• Work Emails, Bills, Ads, News, etc
• Approach:
▪ Word Vectors  Document Vectors
▪ Clustering on Document Vectors
– Use Cosine Similarity
Word Vector Application: Recommendation
• Problem: Find me news stories I care
about!
• Approach:
▪ Word Vectors  Document Vectors
▪ Suggest documents similar to:
a) User search query
b) Example articles that user favorited
Summary
• With word vectors, so many possibilities!
• Can conceptually compare any bunch of words to any other bunch of words.
• Word2Vec finds really good, compact vectors.
▪ Trains a Neural Network
▪ On Context Windows
▪ SkipGram predicts the context words from the middle word in the window.
▪ CBOW predicts the middle word from the context words in the window.
• Word Vectors can be used for all sorts of ML
Word2Vec in Python
Word2Vec in Python - Loading Model
Output:
Input:
from gensim.models.KeyedVectors import load_word2vec_format
google_model = load_word2vec_format(google_vec_file, binary=True)
print(type(google_model.vocab)) # dictionary
print("{:,}".format(len(google_model.vocab.keys()))) # number of words
print(google_model.vector_size) # vector size
dict
3,000,000
300
Word2Vec in Python - Examining Vectors
Output:
Input:
bat_vector = google_model.word_vec('bat')
print(type(bat_vector))
print(len(bat_vector))
print(bat_vector.shape)
print(bat_vector[:5])
<class 'numpy.ndarray'>
300
(300,)
[-0.34570312 0.32421875 0.15722656 -0.04223633 -0.28710938]
Word2Vec in Python - Vector Similarity
Output:
Input:
print(google_model.similarity(‘Bill_Clinton’, ‘Barack_Obama'))
print(google_model.similarity(‘Bill_Clinton', ‘Taylor_Swift’))
0.62116989722645277
0.25381746688228518
As expected, Bill Clinton is much more similar to Barack Obama than to Taylor Swift.
Word2Vec in Python - Most Similar Words
Output:
Input:
print(google_model.similar_by_word('Barack_Obama'))
[('Obama', 0.8036513328552246),
('Barrack_Obama', 0.7766816020011902),
('Illinois_senator', 0.757197916507721),
('McCain', 0.7530534863471985),
('Barack', 0.7448185086250305),
('Barack_Obama_D-Ill.', 0.7196038961410522),
('Hillary_Clinton', 0.6864978075027466),
('Sen._Hillary_Clinton', 0.6827855110168457),
('elect_Barack_Obama', 0.6812860369682312),
('Clinton', 0.6713168025016785)]
Word2Vec in Python - Analogies
Output:
Input:
print(google_model.most_similar(positive=[‘Paris', 'Spain'],
negative=['France'], topn=2))
print(google_model.most_similar(positive=[‘Yankees', ‘Boston'],
negative=['New_York'], topn=2)
[('Madrid', 0.7571904063224792), ('Barcelona', 0.6230698823928833)]
[('Red_Sox', 0.8348262906074524), ('Boston_Red_Sox', 0.7118345499038696)]
Word2Vec in Python - Odd Word Out
Output:
Input:
print(google_model.doesnt_match(['breakfast', 'lunch', 'dinner', ‘table'])
print(google_model.doesnt_match(['baseball', 'basketball', 'football',
‘mattress']))
table
mattress
As expected, “table” and “mattress” are the odd words out.
Word2 vec

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Word embedding
Word embedding Word embedding
Word embedding
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP
NLPNLP
NLP
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 

Ähnlich wie Word2 vec

presentation2-180202073525.pptx
presentation2-180202073525.pptxpresentation2-180202073525.pptx
presentation2-180202073525.pptx
KtonNguyn2
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 

Ähnlich wie Word2 vec (20)

presentation2-180202073525.pptx
presentation2-180202073525.pptxpresentation2-180202073525.pptx
presentation2-180202073525.pptx
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark
JDD 2016 - Tomasz Lelek - Machine Learning With Apache SparkJDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark
JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark
 
Jdd machine learning
Jdd machine learningJdd machine learning
Jdd machine learning
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AIRed Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
 
Context-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vecContext-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vec
 
Sentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkSentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural network
 
Day 4 - Cloud Migration - But How?
Day 4 - Cloud Migration - But How?Day 4 - Cloud Migration - But How?
Day 4 - Cloud Migration - But How?
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
 
Introduction to RavenDB
Introduction to RavenDBIntroduction to RavenDB
Introduction to RavenDB
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Mongo DB at Community Engine
Mongo DB at Community EngineMongo DB at Community Engine
Mongo DB at Community Engine
 
MongoDB at community engine
MongoDB at community engineMongoDB at community engine
MongoDB at community engine
 

Mehr von ankit_ppt

Mehr von ankit_ppt (20)

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
07 learning
07 learning07 learning
07 learning
 
06 image features
06 image features06 image features
06 image features
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
02 image processing
02 image processing02 image processing
02 image processing
 
01 foundations
01 foundations01 foundations
01 foundations
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
 

Kürzlich hochgeladen

Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 

Kürzlich hochgeladen (20)

Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 

Word2 vec

  • 2. Recall: Creating Numerical Features from Text import pandas as pd from sklearn.feature_extraction.text import CountVectorizer corpus = ['This is the first document.', 'This is the second document.', 'And the third one.'] cv = CountVectorizer() X = cv.fit_transform(corpus) pd.DataFrame(X.toarray(), columns=cv.get_feature_names()) Code Output Count Vectorizer
  • 3. Word/Document Vectors with CountVectorizer Document Vectors • Doc 0: [0, 1, 1, 1, 0, 0, 1, 0, 1] • Doc 1: [0, 1, 0, 1, 0, 1, 1, 0, 1] • Doc 2: [1, 0, 0, 0, 1, 0, 1, 1, 0] Flip it around  Word Vectors • and: [0, 0, 1] • document: [1, 1, 0] • first: [1, 0, 0] • is: [1, 1, 0] • one: [0, 0, 1] • second: [0, 1, 0] • the: [1, 1, 1] • third: [0, 0, 1] • this: [1, 1, 0]
  • 4. Why Word Vectors? • Represent Conceptual “meaning” of words • Word vectors close to each other have similar meaning
  • 5. How to Use Word Vectors? • Information Retrieval ▪ e.g. conceptual search queries, concepts related to “painting” • Document Vectors ▪ A document vector is the average of its word vectors • Machine Learning ▪ Document Classification (from document vectors) ▪ Document Clustering • Recommendation ▪ Recommend similar documents to search query or sample documents
  • 6. Can we do better than counts? Binary/Counts TFIDF Dimensionality Reduction Word2Vec Okay Best! Better • Answer: YES! • Problems with counts: ▪ Limited information – Possible Resolution: TFIDF ▪ Vectors HUGE for many documents – Possible Resolution: Matrix Factorization – Bag of Words  No Word Order – Possible Resolution: Neural Networks  Word2vec!
  • 7. How to Use Word Vectors? • Answer: Comparability with human intuition • Standard Baseline Tests: ▪ Analogies ▪ Ratings of Word Similarity ▪ Verb Tenses ▪ Country-Capital Relationships
  • 8. Finding Better Word Vectors: Word2Vec • Problem: Count vectors far too large for many documents. ▪ Solution: Word2Vec reduces number of dimensions (configurable e.g. 300) • Problem: Bag of Words neglects word order. ▪ (Partial) Solution: Word2Vec trains on small sequences of text (“context windows”)
  • 9. Training Word2Vec • Use a Neural Network on Context Windows • 2 main approaches for inputs and labels: ▪ Skip-Grams ▪ Continuous Bag of Words (CBOW) Vectors usually similar, subtle differences, also differences in computational time
  • 10. Training Word2Vec: Context Windows • Input Layer: Context Windows • Observations for word2vec: All context windows in a corpus • Context window size determines size of relevant window around word: ▪ e.g.: Document: "The quick brown fox jumped over the lazy dog.” ▪ Window size: 4, target word “fox”. – Window 1: "The quick brown fox jumped over the lazy dog." – Window 2: "The quick brown fox jumped over the lazy dog.” – Window 3: "The quick brown fox jumped over the lazy dog.” – Window 4: "The quick brown fox jumped over the lazy dog."
  • 11. Training Word2Vec: One-Hot Word Vectors • We need to be able to represent a sequence of words as a vector • Assign each word an index from 0 to V ▪ V is the size of the vocabulary aka # distinct words in the corpus • A word vector is: • 1 for the index of that word • 0 for all other entries • Called One-Hot Encoding
  • 12. Training Word2Vec: One-Hot Context Windows • Need vectors for context windows • A window has vector that's the concatenation of its word vectors • For window size d, the vector is of length (V x d) ▪ Only d entries (one for each word) will be nonzero (1s) If window were “cat dog”
  • 13. Training Word2Vec: SkipGrams • SkipGrams is a neural network architecture that uses a word to predict the words in the surrounding context, defined by the window size. • Inputs: ▪ The middle word of the context window (one-hot encoded) ▪ Dimensionality: V • Outputs: ▪ The other words of the context window (one-hot encoded) ▪ Dimensionality: (V x (d-1)) ▪ Turn the crank!
  • 14. Training Word2Vec: SkipGrams • SkipGrams architecture:
  • 15. Training Word2Vec: CBOW • CBOW (continuous bag of words) uses the surrounding context (defined by the window size) to predict the word. • Inputs: ▪ The other words of the context window (one-hot encoded) ▪ Dimensionality: (V x (d-1)) • Outputs: ▪ The middle word of the context window (one-hot encoded) ▪ Dimensionality: V
  • 16. Training Word2Vec: CBOW • CBOW architecture:
  • 17. Training Word2Vec: Dimension Reduction • Number of nodes in hidden layer, N, is a parameter ▪ It is the (reduced) dimensionality of our resulting word vector space! ▪ Fit neural net  find weights matrix W ▪ Word Vectors: xN = WTx ▪ Checking dimensions: – x: V x 1 – WT: N x V – xN: N x 1
  • 18. Training Word2Vec: What Happened? • Learn words likely to appear near each word • This context information ultimately leads to vectors for related words falling near one another! • Which gives us really good word vectors! Aka “Word Embeddings”
  • 19. Do I need to Train Word2Vec? • Answer: NO! • You can download pre-trained Word2Vec models trained on massive corpora of data. • Common example: Google News Vectors, 300 dimensional vectors for 3 million words, trained on Google News articles. • File containing vectors (1.5 GB) can be downloaded for free and easily loaded into gensim.
  • 20. Nice Properties of Word2Vec Embeddings • word2vec (somewhat magically!) captures nice geometric relations between words ▪ e.g.: Analogies – King is to Queen as Man is to Woman – The vector between King and Queen is the same as that between man and woman! – Works for all sorts of things: capitals, cities, etc
  • 21. Word2Vec with Gensim Output: Input: from gensim.models.KeyedVectors import load_word2vec_format google_model = load_word2vec_format(google_vec_file, binary=True) # woman - man + king print(google_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)) [('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431607246399)]
  • 22. How can we Use Word2Vec? • Vectors can be combined to create features for documents ▪ e.g. Document Vector is average (or sum) of its word vectors • Use Document Vectors for ML on Documents: ▪ Classification, Regression ▪ Clustering ▪ Recommendation
  • 23. Comparing Word2Vec Embeddings • How to compare 2 word vectors? • Cosine Similarity ▪ Scaled angle between the vectors ▪ Vector length doesn’t matter ▪ Makes most sense for word vectors ▪ Why? – e.g. [2, 2, 2] and [4, 4, 4] should be the same vector – It’s the ratios of frequencies that define meaning
  • 24. • Problem: Categorizing News Articles • Is document about Politics? Sports? Science/Tech? etc • Approach: ▪ Word Vectors  Document Vectors ▪ Classification on Document Vectors – Often KNN with Cosine Similarity Word Vector Application: Text Classification
  • 25. Word Vector Application: Text Clustering • Problem: Grouping Similar Emails • Work Emails, Bills, Ads, News, etc • Approach: ▪ Word Vectors  Document Vectors ▪ Clustering on Document Vectors – Use Cosine Similarity
  • 26. Word Vector Application: Recommendation • Problem: Find me news stories I care about! • Approach: ▪ Word Vectors  Document Vectors ▪ Suggest documents similar to: a) User search query b) Example articles that user favorited
  • 27. Summary • With word vectors, so many possibilities! • Can conceptually compare any bunch of words to any other bunch of words. • Word2Vec finds really good, compact vectors. ▪ Trains a Neural Network ▪ On Context Windows ▪ SkipGram predicts the context words from the middle word in the window. ▪ CBOW predicts the middle word from the context words in the window. • Word Vectors can be used for all sorts of ML
  • 28.
  • 30. Word2Vec in Python - Loading Model Output: Input: from gensim.models.KeyedVectors import load_word2vec_format google_model = load_word2vec_format(google_vec_file, binary=True) print(type(google_model.vocab)) # dictionary print("{:,}".format(len(google_model.vocab.keys()))) # number of words print(google_model.vector_size) # vector size dict 3,000,000 300
  • 31. Word2Vec in Python - Examining Vectors Output: Input: bat_vector = google_model.word_vec('bat') print(type(bat_vector)) print(len(bat_vector)) print(bat_vector.shape) print(bat_vector[:5]) <class 'numpy.ndarray'> 300 (300,) [-0.34570312 0.32421875 0.15722656 -0.04223633 -0.28710938]
  • 32. Word2Vec in Python - Vector Similarity Output: Input: print(google_model.similarity(‘Bill_Clinton’, ‘Barack_Obama')) print(google_model.similarity(‘Bill_Clinton', ‘Taylor_Swift’)) 0.62116989722645277 0.25381746688228518 As expected, Bill Clinton is much more similar to Barack Obama than to Taylor Swift.
  • 33. Word2Vec in Python - Most Similar Words Output: Input: print(google_model.similar_by_word('Barack_Obama')) [('Obama', 0.8036513328552246), ('Barrack_Obama', 0.7766816020011902), ('Illinois_senator', 0.757197916507721), ('McCain', 0.7530534863471985), ('Barack', 0.7448185086250305), ('Barack_Obama_D-Ill.', 0.7196038961410522), ('Hillary_Clinton', 0.6864978075027466), ('Sen._Hillary_Clinton', 0.6827855110168457), ('elect_Barack_Obama', 0.6812860369682312), ('Clinton', 0.6713168025016785)]
  • 34. Word2Vec in Python - Analogies Output: Input: print(google_model.most_similar(positive=[‘Paris', 'Spain'], negative=['France'], topn=2)) print(google_model.most_similar(positive=[‘Yankees', ‘Boston'], negative=['New_York'], topn=2) [('Madrid', 0.7571904063224792), ('Barcelona', 0.6230698823928833)] [('Red_Sox', 0.8348262906074524), ('Boston_Red_Sox', 0.7118345499038696)]
  • 35. Word2Vec in Python - Odd Word Out Output: Input: print(google_model.doesnt_match(['breakfast', 'lunch', 'dinner', ‘table']) print(google_model.doesnt_match(['baseball', 'basketball', 'football', ‘mattress'])) table mattress As expected, “table” and “mattress” are the odd words out.

Hinweis der Redaktion

  1. Today we are going to talk about a popular algorithm called SVMs
  2. Let’s recall the logistic regression algorithm Idea here was to transform our loss function, so that we weigh the faraway points less, so that our separation hyperplane does not try too hard to get those right And that was a measure taken in order not to be skewed by outliers
  3. Now let’s explore several possible choices of decision boundaries.
  4. This specific one results in 3 misclassifications. It’s not a good choice.
  5. This specific one results in 3 misclassifications. It’s not a good choice.
  6. This specific one results in 3 misclassifications. It’s not a good choice.
  7. This specific one results in 3 misclassifications. It’s not a good choice.
  8. This specific one results in 3 misclassifications. It’s not a good choice.
  9. The two main approaches for inputs and labels are Skip-Grams, Continuous Bag of Words (CBOW) Skip-Grams uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words CBOW predicts the current word from a window of surrounding context words. The order of context words does not influence prediction --------------------------------------
  10. Context Windows size determines how many words before and after a given word would be included as context words of the given word ---------------------------------------------------------------- This specific one results in 3 misclassifications. It’s not a good choice.
  11. This specific one results in 3 misclassifications. It’s not a good choice.
  12. This specific one results in 3 misclassifications. It’s not a good choice.
  13. Skip-Grams uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words ------------------------------------------------------------------------
  14. Skip-Grams uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words ------------------------------------------------------------------------
  15. CBOW predicts the current word from a window of surrounding context words. The order of context words does not influence prediction ---------------------------------------------------
  16. CBOW predicts the current word from a window of surrounding context words. The order of context words does not influence prediction ---------------------------------------------------
  17. This specific one results in 3 misclassifications. It’s not a good choice.
  18. This specific one results in 3 misclassifications. It’s not a good choice.
  19. This specific one results in 3 misclassifications. It’s not a good choice.
  20. This specific one results in 3 misclassifications. It’s not a good choice.
  21. This specific one results in 3 misclassifications. It’s not a good choice.
  22. This specific one results in 3 misclassifications. It’s not a good choice.
  23. This specific one results in 3 misclassifications. It’s not a good choice.
  24. This specific one results in 3 misclassifications. It’s not a good choice.
  25. This specific one results in 3 misclassifications. It’s not a good choice.
  26. This specific one results in 3 misclassifications. It’s not a good choice.
  27. XXX
  28. XXX
  29. XXX