SlideShare a Scribd company logo
1 of 72
Data Ming TechniquesData Ming Techniques
MUMTAZ KHAN
MS (SEMANTIC WEB)
TF-IDFTF-IDF
TF-IDF stands for Term Frequency & Inverse Document
Frequency .
 Important data for search-figures out what terms are most
relevant for document .
Term frequency: It measures how often a word
occurs in a document.
◦ A word that occurs frequently is probably important to that
document’s meaning.
 Mathematical Form:
TF= (Number of occurrences of the keyword in that particular
document) / (Total number of keywords in the document)
TF-IDF (Continue---- )TF-IDF (Continue---- )
IDF : Inverse Document Frequency measure the
rarity of a term in the whole corpus/documents .
◦ Let N denoting the total number of Documents then the inverse
document frequency of term T is defined as
IDF= Log (N/df) or
= 1+loge (Total Number of Documents/Number of Documents with
that term in it), So
TF-IDF= TF*IDF
Practical Example of TF-IDF
◦ Let we have three documents d1,d2 and d3
◦ d1= The game of life is a game of everlasting learning
◦ d2= The unexamined life is not worth living
◦ d3= Never stop learning
TF-IDF (Continue---- )TF-IDF (Continue---- )
Steps for TF-IDF
TF for d1:
TF for d2:
TF for d3:
Normalized TF for d1,d2 and d3 are as :
d1 :
Term/Word the game of life is a everlasting learning
Frequency 1 2 2 1 1 1 1 1
Term/Word the unexamined life is not worth living
Frequency 1 1   1 1 1 1
Term/Word Never Stop learning
Frequency 1 1 1
Term/Word the game of life is a everlasting learning
Normalized TF 1/10=.1 2/10=.2 2/10=.2 1/10=.1 1/10=.1 1/10=.1 1/10=.1 1/10=.1
TF-IDF (Continue---- )TF-IDF (Continue---- )
d2:
d3:
◦ Note: d1 contains 10 terms/words, d2 contains 7
terms/words and d3 contains 3 terms/words.
Calculation of IDF for each term/word as:
IDF=1+loge (Total Number of Documents/Number
of Documents with that term in it)
Term/Word the unexamined life is not worth living
Normalized TF 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428
Term/Word Never Stop learning
Normalized TF 1/3=.3333 1/3=.3333 1/3=.3333
TF-IDF (Continue---- )TF-IDF (Continue---- )
 Let us Compute the IDF for the
Term Unexamined
IDF=1+loge (Total Number of Documents/Number of Documents with term Unexamined in it).
◦ There are 3 documents in all=d1,d2,d3 but the term unexamined appears in
document d2.
IDF(unexamined)=1+loge (3/1)=2.098726209 , and similarly so on.
Terms IDF
The 1.405507135
Game 2.098726209
Of 2.098726209
Life 1.405507135
Is 1.405507135
A 2.098726209
Everlasting 2.098726209
Terms IDF
Learning 1.405507135
Unexamined 2.098726209
Not 2.098726209
Worth 2.098726209
Living 2.098726209
Never 2.098726209
Stop 2.098726209
TF-IDF (Continue---- )TF-IDF (Continue---- )
 Let us calculate the TF-IDF and the
relevant documents for the query:
life learning
 Note: For each term the TF-IDF as
calculated multiply its normalized term
frequency with its IDF, e.g.
TF-IDF=TF*IDF
Terms D1 D2 D3
life 0.140550715 0.200786736 0
learning 0.140550715 0 0.468502384
TF-IDF (Continue---- )TF-IDF (Continue---- )
 Vector Space Model-Cosine Similarity
◦ For each document we derive a vector
◦ Set of documents in a collection is viewed as a set of
vectors in a vector space.
◦ Each term will have its own axis
 Formula: To find similarity b/w any two documents
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||
Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]
||d1|| = square root(d1[0]2
+ d1[1]2
+ ... + d1[n]2
)
||d2|| = square root(d2[0]2
+ d2[1]2
+ ... + d2[n]2
)
TF-IDF (Continue---- )TF-IDF (Continue---- )
 Vector Space Model-Cosine Similarity
TF-IDF (Continue---- )TF-IDF (Continue---- )
 The TF-IDF for the query : life learning
 Let us calculate the cosine similarity between query and document(d1)
Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||
Document1||
Dot product(Query, Document1)
= ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715))
= 0.197545035151
||Query|| = sqrt((0.702753576) + (0.702753576) ) = 0.993843638185
||Document1|| = sqrt((0.140550715) + (0.140550715) ) = 0.198768727354
Cosine Similarity(Query, Document) = 0.197545035151 / (0.993843638185) *
(0.198768727354)
= 0.197545035151 / 0.197545035151
= 1
life .5 1.405507153 0.702753576
learning .5 1.405507153 0.702753576
TF-IDF (Continue---- )TF-IDF (Continue---- )
 The Similarity score for d1,d2, d3 and query are as:
 Some Demerits of TF-IDF
◦ It is based on bag-of-words model
 therefore it does not capture position in text, semantics, co-occurrences in
different documents, etc.
 For this reason, TF-IDF is only useful as a lexical level feature
◦ Cannot capture semantics (e.g. as compared to topic models,
word embedding's)
Cosine
Similarity
d1 d2 d3
1 0.707106781 0.707106781
ReferencesReferences
1. Van Rijsbergen, Cornelis J., Stephen Edward Robertson, and Martin F. Porter. New
models in probabilistic information retrieval. London: British Library Research and
Development Department, 1980
2. http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html
3. http://www.tfidf.com/
4. Wang, Wei, and Yongxin Tang. "Improvement and Application of TF-IDF Algorithm in
Text Orientation Analysis." (2016).
5. Wu, Ho Chung, et al. "Interpreting tf-idf term weights as making relevance
decisions." ACM Transactions on Information Systems (TOIS) 26.3 (2008): 13.
6. Sparck Jones, Karen. "A statistical interpretation of term specificity and its application in
retrieval." Journal of documentation 28.1 (1972): 11-21.
7. Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text
retrieval." Information processing & management 24.5 (1988): 513-523.
LDA (Latent Dirichlet Allocation)LDA (Latent Dirichlet Allocation)
 Outline of LDA
◦ Introduction
◦ Model Definition
◦ Variational Inferences
◦ Example output and Simulation
◦ References
LDA (Continue---- )LDA (Continue---- )
 Introduction
◦ When more information becomes available, it becomes more
difficult to find and discover what we need.
◦ We need tools to help us organize, search and understand these
vast amount of information.
◦ Topic modeling provides methods for automatically organizing,
understanding, searching, and summarizing large electronic
archives:
LDA (Continue---- )LDA (Continue---- )
 Introduction (Goal of Topic Model)
◦ Document has several topics
◦ Topics are associated with words
◦ Words are expressed through the topics into documents
Documents WordsTopics
Observed Latent Observed
LDA (Continue---- )LDA (Continue---- )
 LDA (Continue….)
◦ LDA is a generative probabilistic model of a corpus
◦ LDA basically to make pLSA a generative model by imposing a Dirichlet
Prior on the Model parameters
◦ LDA is just a Bayesian Version of pLSA, and the parameters are now
much regularized
◦ LDA breaks down the collection of documents into topics
◦ Discover the hidden themes in the collection
◦ Representing the document as a mixture of topics with their probability
distribution.
◦ Topics are represented as a mixture of words with probability
representing the importance of the for each topic.
◦ Discover the hidden themes in the collection
LDA (Continue---- )LDA (Continue---- )
 Twitter Using LDA
◦ Fetch tweets data using “twitteR” package
◦ Load the data into the R environment
◦ Clean the data to remove : re-tweet information, links special characters,
emoticons, frequent words like is, as, this etc.
◦ Create a Term Document Matrix (TMD) using “tm” package.
◦ Calculate TF-IDF i.e. Term Frequency-Inverse Document Frequency for
all the words in TDM
◦ Exclude all the words with TF-IDF <=0.1, to remove all the words which
all less frequent
◦ Calculate the optimal Number of Topics (k) in the Corpus using
log-likelihood function for the TDM calculated
◦ Apply LDA method using “topicmodels” package to discover topics
◦ Evaluate the model
LDA (Continue---- )LDA (Continue---- )
 LDA – generative process
◦ Setting up a generative model
 We have D documents using a vocabulary of V word types
 Each documents contains (up to) N word tokens.
 We assume K topics.
 Each document has a K-dimensional multinomial Փd over topics
with a
 Common Dirichlet prior Dir( )α
 Each topic has a V-dimensional multinomial "Βk over words
with a
 Common symmetric Dirichlet prior D( ).ɳ
LDA (Continue---- )LDA (Continue---- )
 The Generative process
◦ For topic topic k=1…..K do
 -Draw a word-distribution(multinomial) Βk Dir( )∼ ɳ
◦ For each document d=1……D do
 - Draw a topic-distribution (multinomial)θd Dir( )∼ α
 - For each word Wd,n:
◦ - Draw a topic Zd,n Mult(∼ Փd) with Zd,n€ [1…K]
◦ - Draw a word Wd,n Mult( Z∼ β d,n )
LDA (Continue---- )LDA (Continue---- )
 Graphical Model of LDA
LDA (Continue---- )LDA (Continue---- )
 LDA Joint Distribution
LDA (Continue---- )LDA (Continue---- )
 LDA Joint Distribution defines a posterior p(θ,z,β|w)
◦ From a collection of document we have to infer:
 Per-word topic assignment zd,n
 Per-document topic proportions θd
 Per-corpus topic distribution βk
Note:
LDA (Continue---- )LDA (Continue---- )
 Why depends on and ?
 LDA Graphical Model with working Procedure
LDA (Continue---- )LDA (Continue---- )
 LDA Graphical Model with working Procedure
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
 LDA inputs.
◦ Set of words per document for each document in Corpus
 LDA inputs.
◦ Corpus-wide topic vocabulary distributions
◦ Topic assignments per word
◦ Topic proportion per document
LDA (Continue---- )LDA (Continue---- )
STATISTICAL INFERENCEPROBABILISTIC GENERATIVE PROCESS
3 Latent Variables
Word distribution per topic
(Word-topic-matrix)
Topic distribution per doc
(topic-doc-matrix)
Topic word assignment
TOPIC MODELS
LDA (Continue---- )LDA (Continue---- )
 Dirichlet distribution is Conjugate prior of multinomial
distribution
 The parameter α control the mean shape and sparsity of θ
◦ high α= uniform θ , small α= sparse θ
 In LDA the topics are a V-dimensional Dirichlet and the
topic proportion are a K-dimensional Dirichlet
LDA (Continue---- )LDA (Continue---- )
 The Geometric intuition (Simplex)
LDA (Continue---- )LDA (Continue---- )
 The Dirichlet is a “dice factory”
◦ Multivariate equivalent of the Beta distribution (“coin factory”)
◦ Parameters α determine the form of the prior
 The Dirichlet is defined over the (K-1) simplex
◦ The K non-negative arguments which sum to one
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
 To which topics does a given document belong to? Thus want
to compute the posterior distribution of the hidden variables
given a document.
LDA (Continue---- )LDA (Continue---- )
◦ Variational Inference
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
 LDA Summary
◦ LDA Can:
 Visualize the hidden thematic structure in large corpora
 Generalize new data to fit into that structure
 Used for Feature reduction, bioinformatics
 Used for Sentiment analysis, object localization, automatic harmonic analysis for
music
◦ Note: LDA Main Goal
 In each document, allocate its words to few topics
 In each topic, assign high probability to few terms
◦ This from the joint
 Sparse proportions come from the 1st
term
 Sparse topics come from the 2nd
term
◦ Limitations:
 Must know the number of topics k in advance
 Dirichlet topic distribution cannot capture correlations among topics
ReferencesReferences
1. Jelodar, Hamed, et al. "Latent Dirichlet Allocation (LDA) and Topic modeling: models,
applications, a survey." arXiv preprint arXiv:1711.04305 (2017).
2. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
3. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of
machine Learning research3.Jan (2003): 993-1022.
4. Video Lectures of David Blei on videolectures.net: http://videolectures.net/mlss09uk_blei_tm/
5. Campr, Michal, and Karel Ježek. "Comparing semantic models for evaluating automatic
document summarization." International Conference on Text, Speech, and Dialogue. Springer
International Publishing, 2015.
6. Hu, Diane J. "Latent dirichlet allocation for text, images, and music." University of California, San
Diego. Retrieved April 26 (2009): 2013.
7. Jayapal, Arun, and Martin Emms. "Topic Models-Latent Dirichlet Allocation." (2014).
8. Wang, Y. Distributed gibbs sampling of latent topic models: The gritty details. Technical report,
2008.
9. https://cs.stanford.edu/~ppasupat/a9online/1140.html
Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI)
Problems in Lexical matching
Motivation
Introduction
 How LSI Work?
LSI Procedure
SVD
Example
Application
Demerits
LSI(Continue …)LSI(Continue …)
Problems in Lexical Matching
◦ Synonymy
- widespread synonym occurances
-decrease recall.
◦ Polysemy
- retrieval of irrelevant documents
- poor precision
◦ Noise
- Boolean search on specific words
- Retrieval o contently unrelated documents
LSI(Continue …)LSI(Continue …)
Motivation for LSI
◦ To find and fit a useful model of the relationships
between terms and documents.
◦ To find out what terms "really" are implied by a query .
◦ LSI allow the user to search for concepts rather than
specific words.
◦ Stores them in the concept space
◦ LSI can retrieve documents related to a user's query
even when the query and the documents do not share
any common terms
◦ Mathematical model
 Relates documents and the concepts
◦ LSI tries to overcome the problems of lexical matching
LSI(Continue …)LSI(Continue …)
Introduction
◦ LSI is a technique that projects queries and documents
into a space with “latent” semantic dimensions
◦ It uses multidimensional vector space to place all
documents and terms
◦ Each dimension in that space corresponds to a concept
existing in the collection.
◦ Common related terms in a document and query will pull
document and query vector close to each other.
LSI(Continue …)LSI(Continue …)
Concepts in Documents
LSI(Continue …)LSI(Continue …)
How LSI Work?
• A set of documents
 how to determine the similiar ones?
 examine the documents
 try to find concepts in common
 classify the documents
• This is how LSI also works.
• LSI represents terms and documents in a high-dimensional space
allowing relationships between terms and documents to be exploited
during searching.
• Convert high-dimensional space to lower-dimensional space, throw out
noise, keep the good stuff
LSI(Continue …)LSI(Continue …)
LSI Procedure
◦ Obtain term-document matrix.
◦ Compute the SVD.
◦ Truncate-SVD into reduced-k LSI space.
-k-dimensional semantic structure
-similarity on reduced-space:
-term-term
-term-document
-document-document
 Query Procedure
◦ Map the query to reduced k-space
q’=qTUkS-1k,
◦ Retrieve documents or terms within a proximity.
-cosine
-best m
LSI(Continue …)LSI(Continue …)
Singular Value Decomposition (SVD)
◦ LSI use SVD, a linear analysis method:
◦ SVD decomposes the original matrix into three matrixes
 Document eigenvector matrix
 Eigenvalue matrix
 Term eigenvector matrix
◦ SVD of a rectangular matrix A is given by:
 A=U VΣ T
LSI(Continue …)LSI(Continue …)
Singular Value Decomposition (SVD)
◦ For an m n matrix A of rank r there exists aˣ factorization
◦ Singular Vale Decomposition =SVD) as follows
 A=U VΣ T
◦ The columns of U are orthogonal eigenvectors of AAT
◦ The columns of V are orthogonal eigenvectors of AT
A
◦ Eigenvectors λ1 … λr of AAT
are the eigenvectors of AT
A
AAT
=
AT
A=
LSI(Continue …)LSI(Continue …)
Example
◦ Let we have three documents
 d1: Shipment of gold damaged in a fire
 d2: Delivery of silver arrived in a silver truck.
 d3: Shipment of gold arrived in a truck.
◦ Problem: Use Latent Semantic Indexing (LSI) to rank these documents
for the query gold silver truck.
Step 1: Set term weights and construct the term-document matrix A and query matrix:
LSI(Continue …)LSI(Continue …)
Step 2: Decompose matrix A matrix and find the U, S and V matrices, where
LSI(Continue …)LSI(Continue …)
 Step 3: Implement a Rank 2 Approximation by keeping the first two columns of U and V
and the first two columns and rows of S.
LSI(Continue …)LSI(Continue …)
 Step 4: Find the new document vector coordinates in this reduced 2-dimensional space.
 Rows of V holds eigenvector values. These are the coordinates of individual document
vectors, hence
 d1(-0.4945, 0.6492)
 d2(-0.6458, -0.7194)
 d3(-0.5817, 0.2469)
 Step 5: Find the new query vector coordinates in the reduced 2-dimensional space.
 Note: These are the new coordinate of the query vector in two dimensions. Note how
this matrix is now different from the original query matrix q given in Step 1.
so
LSI(Continue …)LSI(Continue …)
 Step 6: Rank documents in decreasing order of query-document cosine similarities.
LSI(Continue …)LSI(Continue …)
 Graphical Representation
 We can see that document d2 scores higher than d3 and d1. Its vector is
closer to the query vector than the other vectors. Also note that Term
Vector Theory is still used at the beginning and at the end of LSI
LSI (Continue…)LSI (Continue…)
Applications of LSI
◦ Information Retrieval
◦ Information Filtering
◦ Relevance Feedback
◦ Improving performance of Search Engines
 in ranking pages
◦ Cross-language retrieval
◦ Automated essay grading
◦ Optimizing link profile of your web page
◦ Modelling of human cognitive function
◦ Dynamic advertisements put on pages, Google’s
AdSense
LSI (Continue…)LSI (Continue…)
Demerits of LSI
◦ Storage
◦ The complexity of the LSI model obtained from
truncated SVD is costly
◦ Its execution efficiency lag far behind the execution
efficiency of the simpler, Boolean models, especially on
large data sets.
◦ The latent topic dimension can not be chosen to arbitrary
numbers. It depends on the rank of the matrix, so can't go
beyond that
◦ Bad for millions of words or documents
◦ Hard to incorporate new words or documents
ReferencesReferences
◦ http://www.bluebit.gr/matrix-calculator/
◦ Rosario, Barbara. "Latent semantic indexing: An overview." Techn. rep. INFOSYS 240 (2000): 1-16.
◦ Ding, Chris HQ. "A probabilistic model for latent semantic indexing." Journal of the Association for
Information Science and Technology 56.6 (2005): 597-608.
◦ Dumais, Susan T. "Latent semantic indexing (LSI) and TREC-2." Nist Special Publication Sp (1994):
105-105.
◦ Alter, Orly, Patrick O. Brown, and David Botstein. "Singular value decomposition for genome-
wide expression data processing and modeling." Proceedings of the National Academy of
Sciences 97.18 (2000): 10101-10106.
◦ Golub, Gene H., and Charles F. Van Loan. Matrix computations. Vol. 3. JHU Press, 2012.
◦ http://www-db.deis.unibo.it/courses/SI-M/
◦ http://web.eecs.utk.edu/research/lsi/
◦ http://lsi.research.telcordia.com/
Word2VectWord2Vect
◦ It is used to generate representation vectors out of
words
◦ Maps words to continuous vector presentations
 i.e. points in an N-dimensional space
◦ Learns vectors from training data (generalizations)
◦ It is a numeric representation for each word
 That enable to capture relationship between words like
synonyms, analogies
Word2VectWord2Vect
Continuous Bag of Words (CBOW)
◦ It predicted the missing word window of context words
 Suppose we given the words Latent Dirichlet, then CBOW
model predict the missing word Allocation, so Latent Dirichlet
Allocation
◦ it useful to identify the missing word in the sentence
◦ It identify the effective sentiment orientations
◦ Randomly initialize input/output weight matrices of sizes
VxN and NxV where V: vocab size, N: vector size
(parameter
◦ Update weight matrices using SGD, backprop. and cross
entropy over corpus
◦ Hidden layer size corresponds to word vector
dimensional.
ConDoc2VectConDoc2Vect
Skip Gram
◦ Method very similar, except now we predict window of
words given single word vector
◦ It predicted the context words given the word
 Suppose we given a word Dirichlet, then Skip-Gram model
predict the context words, Latent Dirichlet Allocation.
◦ Boils down to maximizing dot-product similarity of
context words and target word
◦ Skip-gram typically outperforms CBOW on semantic
and syntactic accuracy (Mikolov et al.)
Word2Vec (Continue…)Word2Vec (Continue…)
Demerits
• Quality depends on input data, number of samples, and
size of vectors (possibly long computation time!)
 But Google released 3 million word vectors trained on 100 billion words!
• Averaging vec’s does not work well (in my experience) on
large text (> tweet level)
• W2V cannot provide fixed-length feature vectors for
variable-length text (pretty much everything!)
Doc2VecDoc2Vec
◦ It generalize Word2Vec to whole documents (phrases,
sentences, etc)
◦ Provide fixed-length vector
◦ Learns Distributed Memory (DM) and Distributed Bag of
Words (DBOW)
Doc2Vec (Continue… )Doc2Vec (Continue… )
Distributed Memory (DM)
◦ Assign and randomly initialize paragraph vector for each
document
◦ Predict next work using context words + paragraph
vector
◦ Slide context window across doc but keep paragraph
vector fixed (hence distributed memory)
◦ Learns Updating done via SGD and backpropagation.
Doc2Vect (Continue… )Doc2Vect (Continue… )
Distributed Bag of Words (DBOW)
◦ Only use paragraph vector (no word vector!)
◦ Take window of words in paragraph and randomly
sample which one to predict using paragraph vector
(ignores word ordering )
◦ Simpler, more memory efficient
◦ DM typically outperforms DBOW
 but DM+DBOW is even better!
Doc2Vec Example)Doc2Vec Example)
Doc2Vec on Wikipedia1
1
Document Embedding with Paragraph Vectors, A. Dai et al., 2014
Doc2Vec Example)Doc2Vec Example)
Doc2Vec on Wikipedia1
◦ LDA vs. Doc2Vec for nearest neighbors to “Machine learning “(bold=unrelated to Machine learning)
1
Document Embedding with Paragraph Vectors, A. Dai et al., 2014
Doc2Vec Example)Doc2Vec Example)
Doc2Vec on Wikipedia1
1
Document Embedding with Paragraph Vectors, A. Dai et al., 2014
Word2Vec (Continue…)Word2Vec (Continue…)
Application
◦ Information Retrieval
◦ Documents classification
◦ Recommendation algorithms
◦ etc
Thank you !
Doc2Vec (Continue…)Doc2Vec (Continue…)
Conclusion
◦ Doc2Vec is more efficient, robust than others
methods such as LIS, LDA, TF-IDF
Thank you !
ReferencesReferences
◦ http://www.bluebit.gr/matrix-calculator/

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

What's hot (20)

Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rules
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Web spam
Web spamWeb spam
Web spam
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Ontology engineering
Ontology engineering Ontology engineering
Ontology engineering
 
Turing Machine
Turing MachineTuring Machine
Turing Machine
 
Disjoint sets
Disjoint setsDisjoint sets
Disjoint sets
 
Decision trees
Decision treesDecision trees
Decision trees
 
Relational algebra ppt
Relational algebra pptRelational algebra ppt
Relational algebra ppt
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Machine Learning Performance metrics for classification
Machine Learning Performance metrics for classificationMachine Learning Performance metrics for classification
Machine Learning Performance metrics for classification
 
Alpha-beta pruning (Artificial Intelligence)
Alpha-beta pruning (Artificial Intelligence)Alpha-beta pruning (Artificial Intelligence)
Alpha-beta pruning (Artificial Intelligence)
 
Decomposition using Functional Dependency
Decomposition using Functional DependencyDecomposition using Functional Dependency
Decomposition using Functional Dependency
 
Normalization
NormalizationNormalization
Normalization
 
Cluster Validation
Cluster ValidationCluster Validation
Cluster Validation
 
Joins in dbms and types
Joins in dbms and typesJoins in dbms and types
Joins in dbms and types
 

Similar to Data mining techniques

TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyTF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
RAtna29
 
Slides
SlidesSlides
Slides
butest
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
Paulo Faria
 

Similar to Data mining techniques (20)

Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Ir models
Ir modelsIr models
Ir models
 
Lec1
Lec1Lec1
Lec1
 
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyTF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processing
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)
 
Search pitb
Search pitbSearch pitb
Search pitb
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Term weighting
Term weightingTerm weighting
Term weighting
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Slides
SlidesSlides
Slides
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
nescala 2013
nescala 2013nescala 2013
nescala 2013
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
 

More from Higher Education Department KPK, Pakistan (6)

On Linked Open Data (LOD)-based Semantic Video Annotation Systems
On Linked Open Data (LOD)-based  Semantic Video Annotation SystemsOn Linked Open Data (LOD)-based  Semantic Video Annotation Systems
On Linked Open Data (LOD)-based Semantic Video Annotation Systems
 
On Annotation of Video Content for Multimedia Retrieval and Sharing
On Annotation of Video Content for Multimedia  Retrieval and SharingOn Annotation of Video Content for Multimedia  Retrieval and Sharing
On Annotation of Video Content for Multimedia Retrieval and Sharing
 
Introduction to cms and wordpress
Introduction to cms and wordpressIntroduction to cms and wordpress
Introduction to cms and wordpress
 
Mpeg 7-21
Mpeg 7-21Mpeg 7-21
Mpeg 7-21
 
WWW Histor
WWW HistorWWW Histor
WWW Histor
 
Webpage classification and Features
Webpage classification and FeaturesWebpage classification and Features
Webpage classification and Features
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 

Data mining techniques

  • 1. Data Ming TechniquesData Ming Techniques MUMTAZ KHAN MS (SEMANTIC WEB)
  • 2. TF-IDFTF-IDF TF-IDF stands for Term Frequency & Inverse Document Frequency .  Important data for search-figures out what terms are most relevant for document . Term frequency: It measures how often a word occurs in a document. ◦ A word that occurs frequently is probably important to that document’s meaning.  Mathematical Form: TF= (Number of occurrences of the keyword in that particular document) / (Total number of keywords in the document)
  • 3. TF-IDF (Continue---- )TF-IDF (Continue---- ) IDF : Inverse Document Frequency measure the rarity of a term in the whole corpus/documents . ◦ Let N denoting the total number of Documents then the inverse document frequency of term T is defined as IDF= Log (N/df) or = 1+loge (Total Number of Documents/Number of Documents with that term in it), So TF-IDF= TF*IDF Practical Example of TF-IDF ◦ Let we have three documents d1,d2 and d3 ◦ d1= The game of life is a game of everlasting learning ◦ d2= The unexamined life is not worth living ◦ d3= Never stop learning
  • 4. TF-IDF (Continue---- )TF-IDF (Continue---- ) Steps for TF-IDF TF for d1: TF for d2: TF for d3: Normalized TF for d1,d2 and d3 are as : d1 : Term/Word the game of life is a everlasting learning Frequency 1 2 2 1 1 1 1 1 Term/Word the unexamined life is not worth living Frequency 1 1   1 1 1 1 Term/Word Never Stop learning Frequency 1 1 1 Term/Word the game of life is a everlasting learning Normalized TF 1/10=.1 2/10=.2 2/10=.2 1/10=.1 1/10=.1 1/10=.1 1/10=.1 1/10=.1
  • 5. TF-IDF (Continue---- )TF-IDF (Continue---- ) d2: d3: ◦ Note: d1 contains 10 terms/words, d2 contains 7 terms/words and d3 contains 3 terms/words. Calculation of IDF for each term/word as: IDF=1+loge (Total Number of Documents/Number of Documents with that term in it) Term/Word the unexamined life is not worth living Normalized TF 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 Term/Word Never Stop learning Normalized TF 1/3=.3333 1/3=.3333 1/3=.3333
  • 6. TF-IDF (Continue---- )TF-IDF (Continue---- )  Let us Compute the IDF for the Term Unexamined IDF=1+loge (Total Number of Documents/Number of Documents with term Unexamined in it). ◦ There are 3 documents in all=d1,d2,d3 but the term unexamined appears in document d2. IDF(unexamined)=1+loge (3/1)=2.098726209 , and similarly so on. Terms IDF The 1.405507135 Game 2.098726209 Of 2.098726209 Life 1.405507135 Is 1.405507135 A 2.098726209 Everlasting 2.098726209 Terms IDF Learning 1.405507135 Unexamined 2.098726209 Not 2.098726209 Worth 2.098726209 Living 2.098726209 Never 2.098726209 Stop 2.098726209
  • 7. TF-IDF (Continue---- )TF-IDF (Continue---- )  Let us calculate the TF-IDF and the relevant documents for the query: life learning  Note: For each term the TF-IDF as calculated multiply its normalized term frequency with its IDF, e.g. TF-IDF=TF*IDF Terms D1 D2 D3 life 0.140550715 0.200786736 0 learning 0.140550715 0 0.468502384
  • 8. TF-IDF (Continue---- )TF-IDF (Continue---- )  Vector Space Model-Cosine Similarity ◦ For each document we derive a vector ◦ Set of documents in a collection is viewed as a set of vectors in a vector space. ◦ Each term will have its own axis  Formula: To find similarity b/w any two documents Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2|| Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n] ||d1|| = square root(d1[0]2 + d1[1]2 + ... + d1[n]2 ) ||d2|| = square root(d2[0]2 + d2[1]2 + ... + d2[n]2 )
  • 9. TF-IDF (Continue---- )TF-IDF (Continue---- )  Vector Space Model-Cosine Similarity
  • 10. TF-IDF (Continue---- )TF-IDF (Continue---- )  The TF-IDF for the query : life learning  Let us calculate the cosine similarity between query and document(d1) Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * || Document1|| Dot product(Query, Document1) = ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715)) = 0.197545035151 ||Query|| = sqrt((0.702753576) + (0.702753576) ) = 0.993843638185 ||Document1|| = sqrt((0.140550715) + (0.140550715) ) = 0.198768727354 Cosine Similarity(Query, Document) = 0.197545035151 / (0.993843638185) * (0.198768727354) = 0.197545035151 / 0.197545035151 = 1 life .5 1.405507153 0.702753576 learning .5 1.405507153 0.702753576
  • 11. TF-IDF (Continue---- )TF-IDF (Continue---- )  The Similarity score for d1,d2, d3 and query are as:  Some Demerits of TF-IDF ◦ It is based on bag-of-words model  therefore it does not capture position in text, semantics, co-occurrences in different documents, etc.  For this reason, TF-IDF is only useful as a lexical level feature ◦ Cannot capture semantics (e.g. as compared to topic models, word embedding's) Cosine Similarity d1 d2 d3 1 0.707106781 0.707106781
  • 12. ReferencesReferences 1. Van Rijsbergen, Cornelis J., Stephen Edward Robertson, and Martin F. Porter. New models in probabilistic information retrieval. London: British Library Research and Development Department, 1980 2. http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html 3. http://www.tfidf.com/ 4. Wang, Wei, and Yongxin Tang. "Improvement and Application of TF-IDF Algorithm in Text Orientation Analysis." (2016). 5. Wu, Ho Chung, et al. "Interpreting tf-idf term weights as making relevance decisions." ACM Transactions on Information Systems (TOIS) 26.3 (2008): 13. 6. Sparck Jones, Karen. "A statistical interpretation of term specificity and its application in retrieval." Journal of documentation 28.1 (1972): 11-21. 7. Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text retrieval." Information processing & management 24.5 (1988): 513-523.
  • 13. LDA (Latent Dirichlet Allocation)LDA (Latent Dirichlet Allocation)  Outline of LDA ◦ Introduction ◦ Model Definition ◦ Variational Inferences ◦ Example output and Simulation ◦ References
  • 14. LDA (Continue---- )LDA (Continue---- )  Introduction ◦ When more information becomes available, it becomes more difficult to find and discover what we need. ◦ We need tools to help us organize, search and understand these vast amount of information. ◦ Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives:
  • 15. LDA (Continue---- )LDA (Continue---- )  Introduction (Goal of Topic Model) ◦ Document has several topics ◦ Topics are associated with words ◦ Words are expressed through the topics into documents Documents WordsTopics Observed Latent Observed
  • 16. LDA (Continue---- )LDA (Continue---- )  LDA (Continue….) ◦ LDA is a generative probabilistic model of a corpus ◦ LDA basically to make pLSA a generative model by imposing a Dirichlet Prior on the Model parameters ◦ LDA is just a Bayesian Version of pLSA, and the parameters are now much regularized ◦ LDA breaks down the collection of documents into topics ◦ Discover the hidden themes in the collection ◦ Representing the document as a mixture of topics with their probability distribution. ◦ Topics are represented as a mixture of words with probability representing the importance of the for each topic. ◦ Discover the hidden themes in the collection
  • 17. LDA (Continue---- )LDA (Continue---- )  Twitter Using LDA ◦ Fetch tweets data using “twitteR” package ◦ Load the data into the R environment ◦ Clean the data to remove : re-tweet information, links special characters, emoticons, frequent words like is, as, this etc. ◦ Create a Term Document Matrix (TMD) using “tm” package. ◦ Calculate TF-IDF i.e. Term Frequency-Inverse Document Frequency for all the words in TDM ◦ Exclude all the words with TF-IDF <=0.1, to remove all the words which all less frequent ◦ Calculate the optimal Number of Topics (k) in the Corpus using log-likelihood function for the TDM calculated ◦ Apply LDA method using “topicmodels” package to discover topics ◦ Evaluate the model
  • 18. LDA (Continue---- )LDA (Continue---- )  LDA – generative process ◦ Setting up a generative model  We have D documents using a vocabulary of V word types  Each documents contains (up to) N word tokens.  We assume K topics.  Each document has a K-dimensional multinomial Փd over topics with a  Common Dirichlet prior Dir( )α  Each topic has a V-dimensional multinomial "Βk over words with a  Common symmetric Dirichlet prior D( ).ɳ
  • 19. LDA (Continue---- )LDA (Continue---- )  The Generative process ◦ For topic topic k=1…..K do  -Draw a word-distribution(multinomial) Βk Dir( )∼ ɳ ◦ For each document d=1……D do  - Draw a topic-distribution (multinomial)θd Dir( )∼ α  - For each word Wd,n: ◦ - Draw a topic Zd,n Mult(∼ Փd) with Zd,n€ [1…K] ◦ - Draw a word Wd,n Mult( Z∼ β d,n )
  • 20. LDA (Continue---- )LDA (Continue---- )  Graphical Model of LDA
  • 21. LDA (Continue---- )LDA (Continue---- )  LDA Joint Distribution
  • 22. LDA (Continue---- )LDA (Continue---- )  LDA Joint Distribution defines a posterior p(θ,z,β|w) ◦ From a collection of document we have to infer:  Per-word topic assignment zd,n  Per-document topic proportions θd  Per-corpus topic distribution βk Note:
  • 23. LDA (Continue---- )LDA (Continue---- )  Why depends on and ?
  • 24.  LDA Graphical Model with working Procedure LDA (Continue---- )LDA (Continue---- )
  • 25.  LDA Graphical Model with working Procedure LDA (Continue---- )LDA (Continue---- )
  • 26. LDA (Continue---- )LDA (Continue---- )  LDA inputs. ◦ Set of words per document for each document in Corpus  LDA inputs. ◦ Corpus-wide topic vocabulary distributions ◦ Topic assignments per word ◦ Topic proportion per document
  • 27. LDA (Continue---- )LDA (Continue---- ) STATISTICAL INFERENCEPROBABILISTIC GENERATIVE PROCESS 3 Latent Variables Word distribution per topic (Word-topic-matrix) Topic distribution per doc (topic-doc-matrix) Topic word assignment TOPIC MODELS
  • 28. LDA (Continue---- )LDA (Continue---- )  Dirichlet distribution is Conjugate prior of multinomial distribution  The parameter α control the mean shape and sparsity of θ ◦ high α= uniform θ , small α= sparse θ  In LDA the topics are a V-dimensional Dirichlet and the topic proportion are a K-dimensional Dirichlet
  • 29. LDA (Continue---- )LDA (Continue---- )  The Geometric intuition (Simplex)
  • 30. LDA (Continue---- )LDA (Continue---- )  The Dirichlet is a “dice factory” ◦ Multivariate equivalent of the Beta distribution (“coin factory”) ◦ Parameters α determine the form of the prior  The Dirichlet is defined over the (K-1) simplex ◦ The K non-negative arguments which sum to one
  • 31. LDA (Continue---- )LDA (Continue---- )
  • 32. LDA (Continue---- )LDA (Continue---- )
  • 33. LDA (Continue---- )LDA (Continue---- )
  • 34. LDA (Continue---- )LDA (Continue---- )
  • 35. LDA (Continue---- )LDA (Continue---- )
  • 36. LDA (Continue---- )LDA (Continue---- )  To which topics does a given document belong to? Thus want to compute the posterior distribution of the hidden variables given a document.
  • 37. LDA (Continue---- )LDA (Continue---- ) ◦ Variational Inference
  • 38. LDA (Continue---- )LDA (Continue---- )
  • 39. LDA (Continue---- )LDA (Continue---- )
  • 40. LDA (Continue---- )LDA (Continue---- )  LDA Summary ◦ LDA Can:  Visualize the hidden thematic structure in large corpora  Generalize new data to fit into that structure  Used for Feature reduction, bioinformatics  Used for Sentiment analysis, object localization, automatic harmonic analysis for music ◦ Note: LDA Main Goal  In each document, allocate its words to few topics  In each topic, assign high probability to few terms ◦ This from the joint  Sparse proportions come from the 1st term  Sparse topics come from the 2nd term ◦ Limitations:  Must know the number of topics k in advance  Dirichlet topic distribution cannot capture correlations among topics
  • 41. ReferencesReferences 1. Jelodar, Hamed, et al. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv preprint arXiv:1711.04305 (2017). 2. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/ 3. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research3.Jan (2003): 993-1022. 4. Video Lectures of David Blei on videolectures.net: http://videolectures.net/mlss09uk_blei_tm/ 5. Campr, Michal, and Karel Ježek. "Comparing semantic models for evaluating automatic document summarization." International Conference on Text, Speech, and Dialogue. Springer International Publishing, 2015. 6. Hu, Diane J. "Latent dirichlet allocation for text, images, and music." University of California, San Diego. Retrieved April 26 (2009): 2013. 7. Jayapal, Arun, and Martin Emms. "Topic Models-Latent Dirichlet Allocation." (2014). 8. Wang, Y. Distributed gibbs sampling of latent topic models: The gritty details. Technical report, 2008. 9. https://cs.stanford.edu/~ppasupat/a9online/1140.html
  • 42. Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI) Problems in Lexical matching Motivation Introduction  How LSI Work? LSI Procedure SVD Example Application Demerits
  • 43. LSI(Continue …)LSI(Continue …) Problems in Lexical Matching ◦ Synonymy - widespread synonym occurances -decrease recall. ◦ Polysemy - retrieval of irrelevant documents - poor precision ◦ Noise - Boolean search on specific words - Retrieval o contently unrelated documents
  • 44. LSI(Continue …)LSI(Continue …) Motivation for LSI ◦ To find and fit a useful model of the relationships between terms and documents. ◦ To find out what terms "really" are implied by a query . ◦ LSI allow the user to search for concepts rather than specific words. ◦ Stores them in the concept space ◦ LSI can retrieve documents related to a user's query even when the query and the documents do not share any common terms ◦ Mathematical model  Relates documents and the concepts ◦ LSI tries to overcome the problems of lexical matching
  • 45. LSI(Continue …)LSI(Continue …) Introduction ◦ LSI is a technique that projects queries and documents into a space with “latent” semantic dimensions ◦ It uses multidimensional vector space to place all documents and terms ◦ Each dimension in that space corresponds to a concept existing in the collection. ◦ Common related terms in a document and query will pull document and query vector close to each other.
  • 47. LSI(Continue …)LSI(Continue …) How LSI Work? • A set of documents  how to determine the similiar ones?  examine the documents  try to find concepts in common  classify the documents • This is how LSI also works. • LSI represents terms and documents in a high-dimensional space allowing relationships between terms and documents to be exploited during searching. • Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff
  • 48. LSI(Continue …)LSI(Continue …) LSI Procedure ◦ Obtain term-document matrix. ◦ Compute the SVD. ◦ Truncate-SVD into reduced-k LSI space. -k-dimensional semantic structure -similarity on reduced-space: -term-term -term-document -document-document  Query Procedure ◦ Map the query to reduced k-space q’=qTUkS-1k, ◦ Retrieve documents or terms within a proximity. -cosine -best m
  • 49. LSI(Continue …)LSI(Continue …) Singular Value Decomposition (SVD) ◦ LSI use SVD, a linear analysis method: ◦ SVD decomposes the original matrix into three matrixes  Document eigenvector matrix  Eigenvalue matrix  Term eigenvector matrix ◦ SVD of a rectangular matrix A is given by:  A=U VΣ T
  • 50. LSI(Continue …)LSI(Continue …) Singular Value Decomposition (SVD) ◦ For an m n matrix A of rank r there exists aˣ factorization ◦ Singular Vale Decomposition =SVD) as follows  A=U VΣ T ◦ The columns of U are orthogonal eigenvectors of AAT ◦ The columns of V are orthogonal eigenvectors of AT A ◦ Eigenvectors λ1 … λr of AAT are the eigenvectors of AT A AAT = AT A=
  • 51. LSI(Continue …)LSI(Continue …) Example ◦ Let we have three documents  d1: Shipment of gold damaged in a fire  d2: Delivery of silver arrived in a silver truck.  d3: Shipment of gold arrived in a truck. ◦ Problem: Use Latent Semantic Indexing (LSI) to rank these documents for the query gold silver truck. Step 1: Set term weights and construct the term-document matrix A and query matrix:
  • 52. LSI(Continue …)LSI(Continue …) Step 2: Decompose matrix A matrix and find the U, S and V matrices, where
  • 53. LSI(Continue …)LSI(Continue …)  Step 3: Implement a Rank 2 Approximation by keeping the first two columns of U and V and the first two columns and rows of S.
  • 54. LSI(Continue …)LSI(Continue …)  Step 4: Find the new document vector coordinates in this reduced 2-dimensional space.  Rows of V holds eigenvector values. These are the coordinates of individual document vectors, hence  d1(-0.4945, 0.6492)  d2(-0.6458, -0.7194)  d3(-0.5817, 0.2469)  Step 5: Find the new query vector coordinates in the reduced 2-dimensional space.  Note: These are the new coordinate of the query vector in two dimensions. Note how this matrix is now different from the original query matrix q given in Step 1. so
  • 55. LSI(Continue …)LSI(Continue …)  Step 6: Rank documents in decreasing order of query-document cosine similarities.
  • 56. LSI(Continue …)LSI(Continue …)  Graphical Representation  We can see that document d2 scores higher than d3 and d1. Its vector is closer to the query vector than the other vectors. Also note that Term Vector Theory is still used at the beginning and at the end of LSI
  • 57. LSI (Continue…)LSI (Continue…) Applications of LSI ◦ Information Retrieval ◦ Information Filtering ◦ Relevance Feedback ◦ Improving performance of Search Engines  in ranking pages ◦ Cross-language retrieval ◦ Automated essay grading ◦ Optimizing link profile of your web page ◦ Modelling of human cognitive function ◦ Dynamic advertisements put on pages, Google’s AdSense
  • 58. LSI (Continue…)LSI (Continue…) Demerits of LSI ◦ Storage ◦ The complexity of the LSI model obtained from truncated SVD is costly ◦ Its execution efficiency lag far behind the execution efficiency of the simpler, Boolean models, especially on large data sets. ◦ The latent topic dimension can not be chosen to arbitrary numbers. It depends on the rank of the matrix, so can't go beyond that ◦ Bad for millions of words or documents ◦ Hard to incorporate new words or documents
  • 59. ReferencesReferences ◦ http://www.bluebit.gr/matrix-calculator/ ◦ Rosario, Barbara. "Latent semantic indexing: An overview." Techn. rep. INFOSYS 240 (2000): 1-16. ◦ Ding, Chris HQ. "A probabilistic model for latent semantic indexing." Journal of the Association for Information Science and Technology 56.6 (2005): 597-608. ◦ Dumais, Susan T. "Latent semantic indexing (LSI) and TREC-2." Nist Special Publication Sp (1994): 105-105. ◦ Alter, Orly, Patrick O. Brown, and David Botstein. "Singular value decomposition for genome- wide expression data processing and modeling." Proceedings of the National Academy of Sciences 97.18 (2000): 10101-10106. ◦ Golub, Gene H., and Charles F. Van Loan. Matrix computations. Vol. 3. JHU Press, 2012. ◦ http://www-db.deis.unibo.it/courses/SI-M/ ◦ http://web.eecs.utk.edu/research/lsi/ ◦ http://lsi.research.telcordia.com/
  • 60. Word2VectWord2Vect ◦ It is used to generate representation vectors out of words ◦ Maps words to continuous vector presentations  i.e. points in an N-dimensional space ◦ Learns vectors from training data (generalizations) ◦ It is a numeric representation for each word  That enable to capture relationship between words like synonyms, analogies
  • 61. Word2VectWord2Vect Continuous Bag of Words (CBOW) ◦ It predicted the missing word window of context words  Suppose we given the words Latent Dirichlet, then CBOW model predict the missing word Allocation, so Latent Dirichlet Allocation ◦ it useful to identify the missing word in the sentence ◦ It identify the effective sentiment orientations ◦ Randomly initialize input/output weight matrices of sizes VxN and NxV where V: vocab size, N: vector size (parameter ◦ Update weight matrices using SGD, backprop. and cross entropy over corpus ◦ Hidden layer size corresponds to word vector dimensional.
  • 62. ConDoc2VectConDoc2Vect Skip Gram ◦ Method very similar, except now we predict window of words given single word vector ◦ It predicted the context words given the word  Suppose we given a word Dirichlet, then Skip-Gram model predict the context words, Latent Dirichlet Allocation. ◦ Boils down to maximizing dot-product similarity of context words and target word ◦ Skip-gram typically outperforms CBOW on semantic and syntactic accuracy (Mikolov et al.)
  • 63. Word2Vec (Continue…)Word2Vec (Continue…) Demerits • Quality depends on input data, number of samples, and size of vectors (possibly long computation time!)  But Google released 3 million word vectors trained on 100 billion words! • Averaging vec’s does not work well (in my experience) on large text (> tweet level) • W2V cannot provide fixed-length feature vectors for variable-length text (pretty much everything!)
  • 64. Doc2VecDoc2Vec ◦ It generalize Word2Vec to whole documents (phrases, sentences, etc) ◦ Provide fixed-length vector ◦ Learns Distributed Memory (DM) and Distributed Bag of Words (DBOW)
  • 65. Doc2Vec (Continue… )Doc2Vec (Continue… ) Distributed Memory (DM) ◦ Assign and randomly initialize paragraph vector for each document ◦ Predict next work using context words + paragraph vector ◦ Slide context window across doc but keep paragraph vector fixed (hence distributed memory) ◦ Learns Updating done via SGD and backpropagation.
  • 66. Doc2Vect (Continue… )Doc2Vect (Continue… ) Distributed Bag of Words (DBOW) ◦ Only use paragraph vector (no word vector!) ◦ Take window of words in paragraph and randomly sample which one to predict using paragraph vector (ignores word ordering ) ◦ Simpler, more memory efficient ◦ DM typically outperforms DBOW  but DM+DBOW is even better!
  • 67. Doc2Vec Example)Doc2Vec Example) Doc2Vec on Wikipedia1 1 Document Embedding with Paragraph Vectors, A. Dai et al., 2014
  • 68. Doc2Vec Example)Doc2Vec Example) Doc2Vec on Wikipedia1 ◦ LDA vs. Doc2Vec for nearest neighbors to “Machine learning “(bold=unrelated to Machine learning) 1 Document Embedding with Paragraph Vectors, A. Dai et al., 2014
  • 69. Doc2Vec Example)Doc2Vec Example) Doc2Vec on Wikipedia1 1 Document Embedding with Paragraph Vectors, A. Dai et al., 2014
  • 70. Word2Vec (Continue…)Word2Vec (Continue…) Application ◦ Information Retrieval ◦ Documents classification ◦ Recommendation algorithms ◦ etc Thank you !
  • 71. Doc2Vec (Continue…)Doc2Vec (Continue…) Conclusion ◦ Doc2Vec is more efficient, robust than others methods such as LIS, LDA, TF-IDF Thank you !