SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
TEXT MINING
Dimensionality Reduction for Text
• Text-based queries can be represented as vectors, which can
be used to search for their nearest neighbors in a document
collection.
• For any nontrivial document database, the number of terms T
and the number of documents D are usually quite large.
• Such high dimensionality leads to the problem of inefficient
computation, since the resulting frequency table will have size
T×D.
• The high dimensionality also leads to very sparse vectors and
increases the difficulty in detecting and exploiting the
relationships among terms.
05/07/15 2TEXT MINING
• Matrix and vector notations
• In the following part, we use
x1,…,xtnЄ Rm
to represent the n documents with m features.
• They can be represented as a term-document matrix
X = [x1,x2….,xn].
05/07/15 3TEXT MINING
Latent Semantic Indexing
• It is one of the most popular algorithms for document
dimensionality reduction.
• It is fundamentally based on SVD (singular value
decomposition).
• The rank of the term-document X is r, then LSI decomposes X
using SVD as follows:
X =U∑VT
 where ∑ = diag(σ1,…. σr) and σ1≥ σ2 ≥.. ≥ σr are the singular
values of X,
 U =[a1,…,ar] and aiis called the left singular vector,
 V = [v1,…,vr], and vi is called the right singular vector.
05/07/15 4TEXT MINING
• LSI uses the first k vectors in U as the transformation matrix
to embed the original documents into a k-dimensional
subspace.
• The basic idea of LSI is to extract the most representative
features, and at the same time the reconstruction error can be
minimized.
05/07/15 5TEXT MINING
• Let a be the transformation vector. The objective function of
LSI can be stated as follows:
aopt = arg mina ||X−aaT
X||2
= arg maxa aT
XXT
a,
• with the constraint,
aT
a = 1.
05/07/15 6TEXT MINING
Locality Preserving Indexing
• It aims to extract the most discriminative features.
• The basic idea of LPI is to preserve the locality information.
• If two documents are near each other in the original document
space, LPI tries to keep these two documents close together in
the reduced dimensionality space.
• Since the neighboring documents probably relate to the same
topic, LPI is able to map the documents related to the same
semantics as close to each other as possible.
05/07/15 7TEXT MINING
• Given the document set x1,…,xnЄRm
, LPI constructs a similarity
matrix S Є Rn×n
.
• The transformation vectors of LPI can be obtained by solving
the following minimization problem:
• aopt = arg mina ∑i,j (aT
xi−aT
xj)2
Sij
= arg minaaT
XLXT
a.
• with the constraint, aT
XDXT
a = 1,
• where L = D−S is the Graph Laplacian.
05/07/15 8TEXT MINING
• LPI constructs the similarity matrix S as
Sij=xt
ixj/||xt
ixj||
• if xi is among the p nearest neighbors of xj or xj is among the p
nearest neighbors of xi
• Sij=0; otherwise.
• Thus, the objective function in LPI incurs a heavy penalty if
neighboring points xi and xj are mapped far apart.
05/07/15 9TEXT MINING
• LPI aims to discover the local geometrical structure of the
document space.
• Since the neighboring documents probably relate to the same
topic, LPI can have more discriminating power than LSI.
• Therefore, for document clustering and document
classification, we might expect LPI to have better performance
than LSI.
05/07/15 10TEXT MINING
Probabilistic Latent Semantic Indexing
• It achieves dimensionality reduction through a probabilistic mixture
model.
• Assume there are k latent common themes in the document
collection, and each is characterized by a multinomial word
distribution.
• A document is regarded as a sample of a mixture model with these
theme models as components.
• We fit such a mixture model to all the documents, and the obtained
k component multinomial models can be regarded as defining k new
semantic dimensions.
• The mixing weights of a document can be used as a new
representation of the document in the low latent semantic
dimensions.
05/07/15 11TEXT MINING
• Formally, let C = {d1,d2,..,dn} be a collection of n documents.
• Let θ1,.., θk be k theme multinomial distributions.
• A word w in document di is regarded as a sample of the
following mixture model.
pdi (w) =∑j=1
k
[∏di,j p(wθj)]
• where ∏ di,j is a document-specific mixing weight for the j-th
aspect theme, and ∑j=1
k
, ∏di,j=1.
05/07/15 12TEXT MINING
• The log-likelihood of the collection C is,
log p(C ۸) =∑i=1
n
∑wЄV[c(w,di)log(∑j=1
k
(∏di,jp(wθj)))],
• where V is the set of all the words,
• c(w,di) is the count of word w in document di,
• And ۸=({θj;{∏di,j}n
i=1}k
j=1) is the set of all the theme model
parameters.
05/07/15 13TEXT MINING
Text Mining Approaches
• The major approaches, based on the kinds of data they take as
input, are
(1)The keyword-based approach, where the input is a set of
keywords or terms in the documents,
(2)The tagging approach, where the input is a set of tags, and
(3)The information-extraction approach, which inputs
semantic information, such as events, facts, or entities
uncovered by information extraction.
05/07/15 14TEXT MINING
• A simple keyword-based approach may only discover
relationships at a relatively shallow level.
• It may not bring much deep understanding to the text.
• The tagging approach may rely on tags obtained by manual
tagging or by some automated categorization algorithm.
• The information-extraction approach is more advanced and
may lead to the discovery of some deep knowledge, but it
requires semantic analysis of text by natural language
understanding and machine learning methods.
05/07/15 15TEXT MINING
Keyword-Based Association Analysis
• It collects sets of keywords or terms that occur frequently
together and then finds the association or correlation
relationships among them.
• It first preprocesses the text data by parsing, stemming,
removing stop words, and so on, and then evokes association
mining algorithms.
• In a document database, each document can be viewed as a
transaction, while a set of keywords in the document can be
considered as a set of items.
• That is, the database is in the format,
{document_id,a_set_of_ keywords}.
05/07/15 16TEXT MINING
• Notice that a set of frequently occurring consecutive or closely
located keywords may form a term or a phrase.
• It can help detect compound associations, domain-dependent
terms or phrases, such as [Stanford, University] or [U.S.,
President, George W. Bush], or noncompound associations,
such as [dollars, shares, exchange, total, commission, stake,
securities].
05/07/15 17TEXT MINING
• Mining based on these associations is referred to as “term-
level association mining”.
• 2 advantages:
(1)Terms and phrases are automatically tagged so there is no
need for human effort in tagging documents; and
(2)The number of meaningless results is greatly reduced, as is
the execution time of the mining algorithms.
05/07/15 18TEXT MINING
Document Classification Analysis
• It deals with the existence of a tremendous number of on-line
documents, it is tedious yet essential to be able to automatically
organize such documents into classes.
• Document classification has been used in
(1)Automated topic tagging ,
(2)Topic directory construction,
(3)Identification of the document writing styles
(4)Classifying the purposes of hyperlinks.
• Procedure:
 First, a set of preclassified documents is taken as the training set.
 The training set is then analyzed in order to derive a classification
scheme.
 The so-derived classification scheme can be used for classification
of other on-line documents.
05/07/15 19TEXT MINING
• This process appears similar to the classification of relational
data.
• For example, in the tuple fsunny; warm; dry; not windy; play
tennisg, the value “sunny” corresponds to the attribute
weather outlook, “warm” corresponds to the attribute
temperature, and so on.
• The classification analysis decides which set of attribute-value
pairs has the greatest discriminating power in determining
whether a person is going to play tennis.
05/07/15 20TEXT MINING
• On the other hand, document databases are not structured
according to attribute-value pairs.
• That is, a set of keywords associated with a set of documents
is not organized into a fixed set of attributes or dimensions.
• Therefore, commonly used relational data-oriented
classification methods, may not be effective.
• A feature selection process can be used to remove terms in the
training documents that are statistically uncorrelated with the
class labels. This will reduce the set of terms to be used in
classification, thus improving both efficiency and accuracy.
05/07/15 21TEXT MINING
• An association-based classification, classifies documents based on a
set of associated, frequently occurring text patterns.
• Notice that very frequent terms are likely poor discriminators.
• Thus only those terms that are not very frequent and that have good
discriminative power will be used in document classification.
• First, keywords and terms can be extracted.
• Second, concept hierarchies of keywords and terms can be obtained
using available term classes such as WordNet, or relying on expert
knowledge, or some keyword classification systems.
05/07/15 22TEXT MINING
Document Clustering Analysis
• It organizing documents in an unsupervised manner.
• Due to the curse of dimensionality, it makes sense to first
project the documents into a lowerdimensional subspace.
• Then, the traditional clustering algorithms can be applied.
• The mixture model clustering method models the text data
with a mixture model, often Involving multinomial component
models.
05/07/15 23TEXT MINING
• Clustering involves two steps:
(1) Estimating the model parameters based on the text data and
any additional prior knowledge, and
(2) Inferring the clusters based on the estimated model
parameters.
• ProbabilisticLatentSemanticAnalysis(PLSA)
• LatentDirichletAllocation(LDA)
• The clusters can be designed to facilitate comparative analysis
of documents.
05/07/15 24TEXT MINING

Weitere ähnliche Inhalte

Was ist angesagt?

Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine LearningSamra Shahzadi
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 

Was ist angesagt? (20)

Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Text Mining
Text MiningText Mining
Text Mining
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Machine learning
Machine learningMachine learning
Machine learning
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Web mining
Web miningWeb mining
Web mining
 
web mining
web miningweb mining
web mining
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Decision tree
Decision treeDecision tree
Decision tree
 

Ähnlich wie 4.4 text mining

A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniquesijnlc
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
 
DisMath-lecture-1-Introduction-to-Discrete-Maths-08032022-114934am.pptx
DisMath-lecture-1-Introduction-to-Discrete-Maths-08032022-114934am.pptxDisMath-lecture-1-Introduction-to-Discrete-Maths-08032022-114934am.pptx
DisMath-lecture-1-Introduction-to-Discrete-Maths-08032022-114934am.pptxAdeel Saifee
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyondErnesto Reig
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...IEEEMEMTECHSTUDENTPROJECTS
 

Ähnlich wie 4.4 text mining (20)

A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
CLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptxCLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptx
 
Text clustering
Text clusteringText clustering
Text clustering
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniques
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
DisMath-lecture-1-Introduction-to-Discrete-Maths-08032022-114934am.pptx
DisMath-lecture-1-Introduction-to-Discrete-Maths-08032022-114934am.pptxDisMath-lecture-1-Introduction-to-Discrete-Maths-08032022-114934am.pptx
DisMath-lecture-1-Introduction-to-Discrete-Maths-08032022-114934am.pptx
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
E1062530
E1062530E1062530
E1062530
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
 
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
 

Mehr von Krish_ver2

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back trackingKrish_ver2
 
5.5 back track
5.5 back track5.5 back track
5.5 back trackKrish_ver2
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithmKrish_ver2
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03Krish_ver2
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programmingKrish_ver2
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-iKrish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03Krish_ver2
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquerKrish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03Krish_ver2
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02Krish_ver2
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing extKrish_ver2
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashingKrish_ver2
 

Mehr von Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 

Kürzlich hochgeladen

BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Employablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxEmployablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxryandux83rd
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineCeline George
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPCeline George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 

Kürzlich hochgeladen (20)

BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Employablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxEmployablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptx
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command Line
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERP
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 

4.4 text mining

  • 2. Dimensionality Reduction for Text • Text-based queries can be represented as vectors, which can be used to search for their nearest neighbors in a document collection. • For any nontrivial document database, the number of terms T and the number of documents D are usually quite large. • Such high dimensionality leads to the problem of inefficient computation, since the resulting frequency table will have size T×D. • The high dimensionality also leads to very sparse vectors and increases the difficulty in detecting and exploiting the relationships among terms. 05/07/15 2TEXT MINING
  • 3. • Matrix and vector notations • In the following part, we use x1,…,xtnЄ Rm to represent the n documents with m features. • They can be represented as a term-document matrix X = [x1,x2….,xn]. 05/07/15 3TEXT MINING
  • 4. Latent Semantic Indexing • It is one of the most popular algorithms for document dimensionality reduction. • It is fundamentally based on SVD (singular value decomposition). • The rank of the term-document X is r, then LSI decomposes X using SVD as follows: X =U∑VT  where ∑ = diag(σ1,…. σr) and σ1≥ σ2 ≥.. ≥ σr are the singular values of X,  U =[a1,…,ar] and aiis called the left singular vector,  V = [v1,…,vr], and vi is called the right singular vector. 05/07/15 4TEXT MINING
  • 5. • LSI uses the first k vectors in U as the transformation matrix to embed the original documents into a k-dimensional subspace. • The basic idea of LSI is to extract the most representative features, and at the same time the reconstruction error can be minimized. 05/07/15 5TEXT MINING
  • 6. • Let a be the transformation vector. The objective function of LSI can be stated as follows: aopt = arg mina ||X−aaT X||2 = arg maxa aT XXT a, • with the constraint, aT a = 1. 05/07/15 6TEXT MINING
  • 7. Locality Preserving Indexing • It aims to extract the most discriminative features. • The basic idea of LPI is to preserve the locality information. • If two documents are near each other in the original document space, LPI tries to keep these two documents close together in the reduced dimensionality space. • Since the neighboring documents probably relate to the same topic, LPI is able to map the documents related to the same semantics as close to each other as possible. 05/07/15 7TEXT MINING
  • 8. • Given the document set x1,…,xnЄRm , LPI constructs a similarity matrix S Є Rn×n . • The transformation vectors of LPI can be obtained by solving the following minimization problem: • aopt = arg mina ∑i,j (aT xi−aT xj)2 Sij = arg minaaT XLXT a. • with the constraint, aT XDXT a = 1, • where L = D−S is the Graph Laplacian. 05/07/15 8TEXT MINING
  • 9. • LPI constructs the similarity matrix S as Sij=xt ixj/||xt ixj|| • if xi is among the p nearest neighbors of xj or xj is among the p nearest neighbors of xi • Sij=0; otherwise. • Thus, the objective function in LPI incurs a heavy penalty if neighboring points xi and xj are mapped far apart. 05/07/15 9TEXT MINING
  • 10. • LPI aims to discover the local geometrical structure of the document space. • Since the neighboring documents probably relate to the same topic, LPI can have more discriminating power than LSI. • Therefore, for document clustering and document classification, we might expect LPI to have better performance than LSI. 05/07/15 10TEXT MINING
  • 11. Probabilistic Latent Semantic Indexing • It achieves dimensionality reduction through a probabilistic mixture model. • Assume there are k latent common themes in the document collection, and each is characterized by a multinomial word distribution. • A document is regarded as a sample of a mixture model with these theme models as components. • We fit such a mixture model to all the documents, and the obtained k component multinomial models can be regarded as defining k new semantic dimensions. • The mixing weights of a document can be used as a new representation of the document in the low latent semantic dimensions. 05/07/15 11TEXT MINING
  • 12. • Formally, let C = {d1,d2,..,dn} be a collection of n documents. • Let θ1,.., θk be k theme multinomial distributions. • A word w in document di is regarded as a sample of the following mixture model. pdi (w) =∑j=1 k [∏di,j p(wθj)] • where ∏ di,j is a document-specific mixing weight for the j-th aspect theme, and ∑j=1 k , ∏di,j=1. 05/07/15 12TEXT MINING
  • 13. • The log-likelihood of the collection C is, log p(C ۸) =∑i=1 n ∑wЄV[c(w,di)log(∑j=1 k (∏di,jp(wθj)))], • where V is the set of all the words, • c(w,di) is the count of word w in document di, • And ۸=({θj;{∏di,j}n i=1}k j=1) is the set of all the theme model parameters. 05/07/15 13TEXT MINING
  • 14. Text Mining Approaches • The major approaches, based on the kinds of data they take as input, are (1)The keyword-based approach, where the input is a set of keywords or terms in the documents, (2)The tagging approach, where the input is a set of tags, and (3)The information-extraction approach, which inputs semantic information, such as events, facts, or entities uncovered by information extraction. 05/07/15 14TEXT MINING
  • 15. • A simple keyword-based approach may only discover relationships at a relatively shallow level. • It may not bring much deep understanding to the text. • The tagging approach may rely on tags obtained by manual tagging or by some automated categorization algorithm. • The information-extraction approach is more advanced and may lead to the discovery of some deep knowledge, but it requires semantic analysis of text by natural language understanding and machine learning methods. 05/07/15 15TEXT MINING
  • 16. Keyword-Based Association Analysis • It collects sets of keywords or terms that occur frequently together and then finds the association or correlation relationships among them. • It first preprocesses the text data by parsing, stemming, removing stop words, and so on, and then evokes association mining algorithms. • In a document database, each document can be viewed as a transaction, while a set of keywords in the document can be considered as a set of items. • That is, the database is in the format, {document_id,a_set_of_ keywords}. 05/07/15 16TEXT MINING
  • 17. • Notice that a set of frequently occurring consecutive or closely located keywords may form a term or a phrase. • It can help detect compound associations, domain-dependent terms or phrases, such as [Stanford, University] or [U.S., President, George W. Bush], or noncompound associations, such as [dollars, shares, exchange, total, commission, stake, securities]. 05/07/15 17TEXT MINING
  • 18. • Mining based on these associations is referred to as “term- level association mining”. • 2 advantages: (1)Terms and phrases are automatically tagged so there is no need for human effort in tagging documents; and (2)The number of meaningless results is greatly reduced, as is the execution time of the mining algorithms. 05/07/15 18TEXT MINING
  • 19. Document Classification Analysis • It deals with the existence of a tremendous number of on-line documents, it is tedious yet essential to be able to automatically organize such documents into classes. • Document classification has been used in (1)Automated topic tagging , (2)Topic directory construction, (3)Identification of the document writing styles (4)Classifying the purposes of hyperlinks. • Procedure:  First, a set of preclassified documents is taken as the training set.  The training set is then analyzed in order to derive a classification scheme.  The so-derived classification scheme can be used for classification of other on-line documents. 05/07/15 19TEXT MINING
  • 20. • This process appears similar to the classification of relational data. • For example, in the tuple fsunny; warm; dry; not windy; play tennisg, the value “sunny” corresponds to the attribute weather outlook, “warm” corresponds to the attribute temperature, and so on. • The classification analysis decides which set of attribute-value pairs has the greatest discriminating power in determining whether a person is going to play tennis. 05/07/15 20TEXT MINING
  • 21. • On the other hand, document databases are not structured according to attribute-value pairs. • That is, a set of keywords associated with a set of documents is not organized into a fixed set of attributes or dimensions. • Therefore, commonly used relational data-oriented classification methods, may not be effective. • A feature selection process can be used to remove terms in the training documents that are statistically uncorrelated with the class labels. This will reduce the set of terms to be used in classification, thus improving both efficiency and accuracy. 05/07/15 21TEXT MINING
  • 22. • An association-based classification, classifies documents based on a set of associated, frequently occurring text patterns. • Notice that very frequent terms are likely poor discriminators. • Thus only those terms that are not very frequent and that have good discriminative power will be used in document classification. • First, keywords and terms can be extracted. • Second, concept hierarchies of keywords and terms can be obtained using available term classes such as WordNet, or relying on expert knowledge, or some keyword classification systems. 05/07/15 22TEXT MINING
  • 23. Document Clustering Analysis • It organizing documents in an unsupervised manner. • Due to the curse of dimensionality, it makes sense to first project the documents into a lowerdimensional subspace. • Then, the traditional clustering algorithms can be applied. • The mixture model clustering method models the text data with a mixture model, often Involving multinomial component models. 05/07/15 23TEXT MINING
  • 24. • Clustering involves two steps: (1) Estimating the model parameters based on the text data and any additional prior knowledge, and (2) Inferring the clusters based on the estimated model parameters. • ProbabilisticLatentSemanticAnalysis(PLSA) • LatentDirichletAllocation(LDA) • The clusters can be designed to facilitate comparative analysis of documents. 05/07/15 24TEXT MINING