SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
International Journal of Research in Computer Science
eISSN 2249-8265 Volume 2 Issue 4 (2012) pp. 7-12
© White Globe Publications
www.ijorcs.org


        PRIVACY PRESERVING MFI BASED SIMILARITY
         MEASURE FOR HIERARCHICAL DOCUMENT
                     CLUSTERING
                                      P. Rajesh1, G. Narasimha2, N.Saisumanth3
                             1,3
                                 Department of CSE, VVIT, Nambur, Andhra Pradesh, India
                                             Email: rajesh.pleti@gmail.com
                                        Email: saisumanth.nanduri@gmail.com
                           2
                             Department of CSE, JNTUH, Hyderabad, Andhra Pradesh, India
                                            Email: narasimha06@gmail.com

Abstract: The increasing nature of World Wide Web           navigation steps to find relevant documents. So we
has imposed great challenges for researchers in             need a hierarchical clustering that is relatively flat that
improving the search efficiency over the internet. Now      reduces the number of navigation steps. Therefore
days web document clustering has become an                  there is a great need for new document clustering
important research topic to provide most relevant           algorithms, which are more efficient than conventional
documents in huge volumes of results returned in            clustering algorithms [1, 2].
response to a simple query. In this paper, first we
                                                               The increasing nature of World Wide Web has
proposed a novel approach, to precisely define
                                                            imposed great challenges for researchers to cluster the
clusters based on maximal frequent item set (MFI) by
                                                            similar documents over the internet and their by
Apriori algorithm. Afterwards utilizing the same
                                                            improving the efficiency of search. Search engine uses
maximal frequent item set (MFI) based similarity
                                                            the getting more confused in selecting the relevant
measure for Hierarchical document clustering. By
                                                            documents among huge volumes of search results
considering maximal frequent item sets, the
                                                            returned to a simple query. A potential solution to this
dimensionality of document set is decreased. Secondly,
                                                            problem is to cluster the similar web documents, which
providing privacy preserving of open web documents
                                                            helps the user in identifying the relevant data easily
is to avoiding duplicate documents. There by we can
                                                            and effectively [3].
protect the privacy of individual copy rights of
documents. This can be achieved using equivalence              The outline of this paper is divided into six
relation.                                                   sections. section II, briefly discusses related work. We
                                                            explained our proposed algorithm description
Keywords: Maximal Frequent Item set, Apriori
                                                            including common preprocessing steps and pseudo
algorithm,    Hierarchical document clustering,
                                                            code of algorithm in section III. It also includes to
equivalence relation.
                                                            precisely defining clusters based on maximal frequent
                                                            item set (MFI) by Apriori algorithm. Section IV,
                 I. INTRODUCTION
                                                            describes exploiting the same maximal frequent item
    Document clustering has been studied intensively        set (MFI) based similarity measure for Hierarchical
because of its wide applicability in areas such as web      document clustering with running example. In section
mining, search engines, text mining and information         V, provides privacy preserving of open web
retrieval. The rapid progress of databases in every         documents by using equivalence relation to protect the
aspect of human actions has resulted in enormous            individual copy rights of a document.. Section VI,
demand for efficient algorithms for spinning data into      consists of conclusion and future scope.
valuable knowledge.
                                                                            II. RELATED WORK
   Document clustering has undergone through
various methods, still document clustering is in its            The related work of using maximal frequent item
inefficiency state for providing the required               set in web document clustering is explained as follows.
information needed by the user exactly and                  Ling Zhuang Honghua Dai [4] introduced a new
approximately. Suppose the user makes an incorrect          criterion to specifically locate the initial points using
selection while browsing the documents in hierarchy.        maximal frequent item set. These initial points are then
If user may not notice his mistakes until he browses        used as centers for k-means algorithm. However k-
into the deep portion of the hierarchy, then it decreases   means clustering is completely unstructured approach,
the efficiency of search and increases the number of        sensitive to noise and produces an unorganized


                                                                             www.ijorcs.org
8                                                                                  P. Rajesh, G. Narasimha, N.Saisumanth
collection of clusters that is not favorable to              based similarity measure . The clusters in the resulting
interpretation [5, 6]. To minimize the overlapping of        hierarchy are non-overlapping. The parent cluster
documents, Beil, Ester [7] were proposed a method            contains only the general documents.
HFTC (Hierarchical Frequent Text Clustering) is
another frequent item set based approach to choose the                 III. ALGORITHM DESCRIPTION
next frequent item sets. But the clustering result               In this section, we explained our proposed
depends on the order of choosing next frequent item          algorithm       description      including    common
sets. The resulting hierarchy in HFTC usually contains       preprocessing steps and pseudo code of algorithm. It
many clusters at first level. As a result the documents      also includes to precisely defining clusters based on
in the same class are to be distributed into different       maximal frequent item set (MFI) by Apriori algorithm.
branches of hierarchy, which decreases the overall           First, we will speak about some common
clustering accuracy.                                         preprocessing steps for representing each document by
   C.M.Fung [8] has introduced FIHC (Frequent Item           item sets (terms). Second we will bring in vector space
set based Hierarchical Clustering) method for                model by assigning weights to terms in all document
document clustering. Which employed, a cluster topic         sets. Finally, we will explain the process of
tree is constructed based on the similarity among            initialization of clusters seeds using MFI to perform
clusters. FIHC used the efficient child pruning when         hierarchical clustering. Let Ds represents set of all
number of clusters is large and to apply the elaborated      documents in collection of database.
sibling merging only when number of clusters is small.                 Ds= {d1, d2, d3………dM}: 1 ≤ i ≤ M
The experiment results FIHC actually outperforms all
other algorithms (bisecting-k means, UPGMA) in               A. Pre-Processing
accuracy for most number of clusters.                           The document set Ds is converted from
   The Apriori algorithm [9] is a well-known method          unstructured format into some common representation
for computing frequent item sets in a transaction            using the text preprocessing techniques, in which
database. The document under the same topic, shares          words or terms are extracted (tokenization). The input
more common frequent item sets (terms) than the              data set of documents in Ds are preprocessed using the
documents of different topics. The main advantage of         techniques namely, removing HTML tags first, after
using frequent item sets is that it can identify the         that apply stop words list and stemming algorithm.
relation among the more than two documents at a time            a) HTML Tags: parsing of HTML Tag
in a document collection unlike similarity measure              b) Stop words: Remove the stop words list like
between two documents [10, 11].By the means of                     “conjunctions, connectives, prepositions etc”
maximal frequent item sets, the dimensionality of the           c) Stemming algorithm: We utilize porter 2
document set is reduced. More over maximal frequent                stemmer algorithm in our approach.
item sets captures most related document sets. On the
other hand, hierarchical clustering most relevant for        B. Vector representation of document:
browsing and maps most specific documents to
generalized documents in the whole collection.                    Vector space model is the most commonly used
                                                             document representation model in text mining, web
   A conventional hierarchical clustering method             mining and information retrieval areas. In this model
constructs the hierarchy by subdividing parent cluster       each document is represented as n-dimensional term
or merging similar children clusters. It usually suffers     vector. The value of each term in the n-dimensional
from its inability to perform tuning once a merge or         vector reflects the importance of corresponding
split decision has been performed. This rigidity may         document. Let N be the total number of terms and M
lower the clustering accuracy. Furthermore, due to the       be the number of documents and each the document

                                                              𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤ i≤ M. Where
fact that a parent cluster in the hierarchy always           can                 be                   denoted               as

                                                              𝑑𝑓(𝑡𝑒𝑟𝑚 𝑖𝑗 ) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
contains all objects of its Childs, this kind of hierarchy

                                                             frequency 𝑡𝑒𝑟𝑚 𝑖𝑗 is less than the threshold value is
is not suitable for browsing. The user may have                                                value.        The     document
difficulty to locate his intention object in such a large
cluster.                                                     considered to avoid the problem of more times a term
   Our hierarchical clustering method is completely          appears throughout all documents in the whole
different. The aim of this paper is, first we form all       collection, the more poorly it discriminates between
the clusters by assigning documents to the most similar      documents [12].Calculate term frequency tf is number
cluster using maximal frequent item sets by Apriori          of times a term appears in a document. Document
                                                             frequency of a term df as no of documents that

                                                             documents vectors. 𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 )
algorithm       and then construct the hierarchical
document clustering based on their inter-cluster             contains term. Also construct the weights for

                                                                            Where 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝐼𝐷𝑓(𝑗) and
similarities via same maximal frequent item set (MFI)



                                                                                www.ijorcs.org
IDf (j) =𝑙𝑜𝑔 �            �1≤j≤n.where IDf is the inverse
                    𝑚
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering                                           9

                   𝑑𝑓 𝑗
                                                                 A frequent item set is a set of words which occurs
                                                             frequently together and are good candidates for


                                                             such that X ⊂ X1 and t(X) = t(X1), where t(X) defined
 document frequency.                                         clusters and are denoted by FI. An item set X is closed
Table 1: Table Representation of Transactional Database of   if there does not exist an item set X1 such that X1,
                       Documents
                                                             as the set of transactions that contain item set X and it
Terms      Doc 1          Doc 2   Doc 3   .....   Doc 4
                                                             is denoted by FCI(frequently closed items).If X is
Java       1              1       0       .....   1
                                                             frequent and no superset of X is frequent among the
Beans      0              1       0       .....   0


                                                             MFI. Then MFI⊂ FCI ⊂ FI Whenever there are very
                                                             set of items I in transactional databases. Then we say
.....      .....          …..     …..     .....   …..        that X is maximal frequent item set and denoted by
Servlets   1              0       1       .....   1

   By the representation of document as vector form,         long patterns are present in the data it is often
we can easily identify which documents Contains the          impractical to generate the entire set if frequent item
same features .The more features documents have in           sets or closed item sets [16]. In that case, maximal
common, the more related they are. Thus, it is realistic     frequent item sets are adequate for such applications.
to find well related documents. Assume that each             We employed maximal frequent item set algorithm
document is an item in the transactional database; each      from [17] using apriori. These maximal frequent item
term corresponds to a transaction. Our aim is to search      sets are initial seeds for hierarchical document
for highly related documents “appearing” together            clustering.
with same features (the documents whose MFI features         D. Pseudo code Algorithm
are closed). Similarly, the maximal frequent item set
discovery in the transaction database serves the               For MFI Based Similarity Measure for Hierarchical
purpose of finding items of documents appearing              Document Clustering
together in many transactions. i.e., document sets           Input: Document set Ds.
which have large amount of feature in common.
                                                             Definition: MFI: Maximal Frequent Item set.
C. Apriori for maximal frequent item sets
                                                             (tf) Term frequency and (df) document frequency
    Mining frequent item sets is a primary content of
                                                             Step 1. For  each document in Ds, Remove the HTML
data mining that emphasizes particularly in finding the
relation of different items in the large database. Mining            tags and perform stop word list and stemming.
                                                             Step 2. Calculate the term frequency (tf) and document


                                                              𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤i≤M
frequent patterns is crucial problem in many data
mining applications such as the discovery of                         frequency (df).


                                                             Where df�𝑡𝑒𝑟𝑚 𝑖𝑗 � < Threshold value
association rules, correlations, multidimensional
patterns, and other numerous important inferring
patterns from consumer market basket analysis and
web access etc. The association mining problem is            Step 3. Also  construct the weighted document vectors

                                                             𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 )           𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗
formulated as follows: Given a large data base of set of            for all the documents


                                                             𝐼𝐷𝑓(𝑗).Idf (j) =𝑙𝑜𝑔 �              � 1≤j≤n.
items transactions, find all frequent item sets, where a
                                                                                          𝑚
                                                                                                           Where
frequent item set is one that occurs in at least a user-
                                                                                         𝑑𝑓 𝑗
specified threshold value of the data base. Many of the
proposed item set mining algorithms are a variant of
                                                             Step 4. Now  represent each documents by keywords
Apriori, which employs a bottom-up, breadth first
                                                                   whose tf>support
search that enumerates every single frequent item set.


                                                              𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 , … … … … . . 𝐹 𝑛 }
Apriori is a conventional algorithm that was first           Calculate the Maximal Frequent Item set(MFI) of
introduced] for mining association rules. Association        terms           using            Apriori  algorithm

                                                              Where each 𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 , … … … 𝑑 𝑘 }
can be viewed as two-step process as


                                                                        a document 𝑑 𝑖 is in more than one maximal
                                                                     frequent item set then choose 𝐼 𝑑 as a set
   (1) Identifying all frequent item sets
                                                             Step 5. If
   (2) Generating strong association rules from the


                                                                     containing document 𝑑 𝑖 . Then Assign𝐼 𝑥 =𝐼 𝑑0 .For
      frequent item sets
                                                                     consisting of such maximal frequent item sets
   At first, candidate item sets are generated and

                                                                     the document 𝑑 𝑖
afterwards frequent item sets are mined with the help                each the maximal frequent item sets containing

                                                                  𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑 𝑖 ))
of these candidate item sets. In the proposed approach,

                                                                                  > 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑 𝑖 ))]
we have used only the frequent item sets for further
processing so that, we undergone only the first step
(generation of maximal frequent item sets) of the
Apriori algorithm.


                                                                                   www.ijorcs.org
Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 .Assign the document 𝑑 𝑖 to 𝐼 𝑥           𝐹𝑖 𝑙𝑖𝑘𝑒 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 } as one cluster in hierarchy
10                                                                                          P. Rajesh, G. Narasimha, N.Saisumanth


and discard 𝑑 𝑖 for other maximal frequent item sets.
                                                                         Case 3: If 𝐹𝑖 , 𝐹𝑗 contains some same documents
                                                                   and represent it by center (as in step6).
Repeat this process for all documents that occurs in


                                                                   consider the case of document 𝑑2 is repeatedin more
more than one maximal frequent item set


        these maximal frequent item sets 𝐹𝑖 as clusters            than one maximal frequent item sets{𝐹1 𝐹4 }.Similarly
                                                                   among the documents list obtained from MFI. Let us


        and combine the documents in 𝐹𝑖 into a single
Step 6. Apply  hierarchical document clustering to make

                                                                     𝑑4 is repeated in{ 𝐹1 , 𝐹2 , 𝐹4 }. Then choose𝐼 𝑑 =
                                                                   { 𝐹1 , 𝐹2 , 𝐹4 } = { 𝐼 𝑑0 , 𝐼 𝑑1 , 𝐼 𝑑2 }for    document𝑑4 .Assign
                                                                     𝐼 𝑥 =𝐼 𝑑0 = 𝐹1 . For each the maximal frequent item sets
        new document and represent it by centers of the

                                                                              𝐼 𝑑 containing                              𝑑4
        maximal frequent item sets. These are obtained

                                                                    𝐼 𝑑0 𝑡𝑜 𝐼 𝑑2 calculate the measure
        by combining the features of maximal frequent              in                                the       document         from


                                                                       𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑4 ))
        item set of terms that grouping the documents


                                                                                      > 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑4 ))]
Step 7. Repeat     the same process of hierarchical
        document clustering based on maximal frequent


                                                                   document 𝑑4 closest to which maximal frequent item
        item sets for all levels in hierarchy and stop if
        total number of documents equals to one else go               By using this jaccards measure, we can identify the


                                                                   document 𝑑4 .Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 .
        to step 4.
                                                                   set among maximal frequent item sets containing the


                                                                         Let’s suppose that 𝑑4 is closed to the maximal
     IV. HIERARCHICAL CLUSTERS BASED ON


                                                                   frequent item set 𝐹4 . Assign the document𝑑4 to𝐼 𝑥 =
           MAXIMAL FREQUENT ITEM SETS


                                                                    𝐼 𝑑𝑖 = 𝐹4 and discard 𝑑4 for other maximal frequent
   After finding maximal frequent item sets (MFI) by
using Apriori algorithm. We turn to describing the

                                                                   exactly one cluster. Similarly 𝑑2 belongs to𝐹1 .Repeat
creation of hierarchical document clustering using                 item sets. After this step, each document belongs to
same similarity measure by MFI. A simple instance


among the whole collection of documents 𝐷 𝑆 by
case of example is also provided to demonstrate the

                                                                     𝑑2 , 𝑑4 are repeated in𝐹1 , 𝐹4 . The clusters that will form
                                                                   this process for all documents that occurs in more than


apriorialgorithm are 𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 … . . 𝐹 𝑛 }.Where
entire process. The set of maximal frequent item sets              one maximal frequent item set. Since the documents



by𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 … . . 𝑑 𝑘 }.Then consider total number
                                                                   at the first level of hierarchy by applying step5 and


                                                                        𝐹1 = {𝑑2 , 𝑑6 }
each MFI consist of set of documents represented                   step 6 are as follows.


                                                                        𝐹2 = {𝑑3 , , 𝑑8 }
of documents which occurs in maximal frequent item

                        𝑑1 , 𝑑2 , 𝑑3, 𝑑4 , 𝑑5 , 𝑑6 , 𝑑7 , 𝑑8 ,          𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 }
sets in MFI as follows.

           𝑀𝐹𝐼 = �                                             �
                      𝑑9 , 𝑑10 , 𝑑11 , 𝑑12 , 𝑑13 , 𝑑14 , 𝑑15
                                                                        𝐹4 = {𝑑4 , , 𝑑14 }
     𝐹1 = {𝑑2 , 𝑑4 , 𝑑6 }
                                                                        𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 }
     𝐹2 = {𝑑3 , 𝑑4 , 𝑑8 }
                                                                        𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 }
     𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 }
     𝐹4 = {𝑑4 , 𝑑2 , 𝑑14 }
                                                                      The hierarchical diagram for the above form of

     𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 }
                                                                   maximal frequent item set clusters can be representing


     𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 }
                                                                   as follows. Repeat the same process of hierarchical
                                                                   document clustering based on maximal frequent item
                                                                   sets for all levels in hierarchy and stop if total number
   The clusters in the resulting hierarchy are non-                of documents equals to one else go to step 4.
overlapping. This can be achieved through the


    Case1: If 𝐹𝑖 , 𝐹𝑗 are same then choose one in random
following cases.




   Case2: If 𝐹𝑖 , 𝐹𝑗 are different then form clusters of
to form cluster.


documents contained in𝐹𝑖 , 𝐹𝑗 independently. In our

in 𝐹3 , 𝐹5 and 𝐹6 𝑎𝑟𝑒 different. So we form a clusters
example, the maximal frequent item set of documents

according to the documents contained in
                                                                      Figure 1: Hierarchical document clustering using MFI



                                                                                       www.ijorcs.org
Represent each new document �𝐿 𝑖𝑗 � in hierarchy by
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering                                                11
                                                               itself. When we are classifying the documents into
maximal frequent item set of terms as centers (as in           equivalence classes, we are not considering these ones
step 6).These maximal frequent item sets are obtained          and put zeros. Jaccard similarity coefficient matrix for
by combining the features of maximal frequent item             four documents can be represented as follows.
set of terms that grouping the documents. Each new
                                                                                     d1      d2     d3     d4

�𝐿 𝑖𝑗 � represents that jth document in the level of
document also consisting of corresponding updated
weights of maximal frequent item set of terms. Where                        d 1  1 0.4 0.8 0.5

hierarchy𝐿 𝑖 . In the figure { 𝐿12 = 𝐿21 }means that the
                                                                            d 2 0.4 1 0.8 0.4
                                                                        Rα =                  

level 𝐿1 are not matched with other documents MFI set
                                                                            d 3 0.8 0.8 1 0.9
maximal frequent item set of terms in 2nd document of                                         
                                                                            d 4 0.5 0.4 0.9 1 
in same level𝐿1 .So it is repeated same for the next
level and it is also same for the document { 𝐿13 =             Ds = {d1 , d2 , d3 , d4 }as the collectionof document pairs
 𝐿22 }. The documents{ 𝐿11 , 𝐿15 } and{ 𝐿14 , 𝐿16 } in first
                                                                  Where alpha is threshold. Let define a relation R on


                                                               value. i.e 𝑅 = {(𝑑 𝑖 , 𝑑 𝑗 )/ 𝐽 (𝑑 𝑖 , 𝑑 𝑗 ) ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 }
                                                               whose similarity measure is above some threshold


level as 𝐿23 , 𝐿24 .
level are combined using MFI based hierarchical

                                                                1. R is reflexive on Ds iff 𝑅 (𝑑 𝑖 , 𝑑 𝑖 ) = 1. i.e Every
clustering and represent these documents in the second



                                                                2. R is symmetric on Ds iff𝑅 �𝑑 𝑖 , 𝑑 𝑗 � = 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �i.e
                                                                   document is mostly related to itself.

                                                                   if the document 𝑑 𝑖 is similar to 𝑑 𝑗 then the
       V. PRIVACY PRESERVING OF WEB


                                                                   document 𝑑 𝑗 is also similar to𝑑 𝑖 .
        DOCUMENTS USING EQUIVALENCE
                  RELATION
   Most internet web documents are publicly available

                                                                   𝑅 (𝑑 𝑖 , 𝑑 𝑘 ) ≥ 𝑚𝑎𝑥 𝑗 { min{𝑅 �𝑑 𝑖 , 𝑑 𝑗 �, 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �}}.
for providing services required by the user. In such            3. R is transitive on Ds iff
documents there is no confidential or sensitive data
(open to all). Then how can we provide privacy of
such documents. Now a days, same information will              Then R is transitive by the definition.
be exists in more than one document in duplicate
                                                                   Then R is an equivalence relation on Ds, which
forms. The way of providing privacy preserving of
                                                               partitions the input document set Ds into set of
documents is by avoiding duplicate documents. There
                                                               equivalence classes. Equivalence relation seems a
by we can protect the privacy of individual copy rights
                                                               natural          technique    for    duplicate    document
of documents. Many duplicate document detection
                                                               categorization. Any two documents in same
techniques are available such as syntactic, URL based,
                                                               equivalence class are related and are different if they
semantic approaches. In each technique, a processing
                                                               are coming from two equivalence classes. The set of
overhead of maintaining shingling’s, signatures,
                                                               all equivalence classes induces the document set Ds.
fingerprints [13, 14, 15, 18]. In this paper, we
                                                               High syntactic similarity pairs of documents typically
proposed a new technique for avoiding duplicate
                                                               referred to as duplicates or near duplicates except
documents using equivalence relation. Let Ds be the
                                                               diagonal elements. By using equivalence relation,
input duplicate document set is subset to web
                                                               easily we can identify the duplicate documents or we
document collection. First find the jaccard similarity
                                                               can perform the clustering on duplicate documents.
measure for every pair of documents in Ds using
                                                               Apart from the representation of feature document
weighted feature representation of maximal frequent
                                                               vector by MFI, we also need to consider that who is
item sets discussed in step 2 and step 3 in algorithm. If
                                                               the author of document, when the document was
the similarity measure of two documents is equal to 1,
                                                               created, where it is available, helps in effectively
then the two documents are most similar. If the
                                                               finding the duplicate documents. Each document in
measure is 0, then they are not duplicates. The Jaccard
                                                               input Ds must belong to unique equivalence class. If R
index or the Jaccard similarity coefficient is a
                                                               is equivalence relation on Ds = {d1, d2, d3, d4 …..dn}.
statistical measure of similarity between sample sets.
                                                               Then number of equivalence relations on Ds is always
For two sets, it is denoted as the cardinality of their
                                                               lies between n ≤ | R|≤ n2. i.e the time complexity of
intersection divided by the cardinality of their union.


                                 |𝑑1 ∩ 𝑑2 |
                                                               calculating equivalence relation on Ds is O(n2).

                                                               .i.e𝐽 �𝑑 𝑖 , 𝑑 𝑗 � ≥ 0.8. Since the matrix is symmetric, the
Mathematically

                 𝐽(𝑑1 , 𝑑2 ) =
                                                               Choose the threshold α in equivalence relation as 0.8

                                 |𝑑1 ∩ 𝑑2 |                    documents sets {(𝑑3 , 𝑑1 ), (𝑑3 , 𝑑2 ), (𝑑4 , 𝑑3 )}      are
                                                               mostly related. Hence the documents are near
   For every pair of two documents calculate jaccard           duplicates and grouping the documents into clusters
measure of d1, d2.All the diagonal elements in matrix          thereby providing privacy of individual copy rights of
are ones, because every document mostly related to             documents.


                                                                                  www.ijorcs.org
12                                                                                   P. Rajesh, G. Narasimha, N.Saisumanth

                          0    0    1   0                          Data mining 2002 (KDD-2002), Edmonton, Alberta,
                          0    0    1   0
                                                                     Canada.
                  R 0.8 =                                     [8] BenjaminFung, C.M., Wang, Ke., Ester, Martin. (2003).
                          1    1    0   1                          “Hierarchical Document Clustering using Frequent Item
                                                                   Sets”. In Proceedings SIAM International Conference
                          0    0    1   0                          on Data Mining 2003 (SIAM DM-2003), pp:59-70.
                                                                [9] Agrawal, R., Srikant, R. (1994). “Fast Algorithms for
      VI. CONCLUSION AND FUTURE SCOPE                                Mining Association Rules”. In the Proceedings of 20th
                                                                     International Conference on Very Large Data Bases,
   Cluster analysis can be used as powerful ,stranded
                                                                     1994, Santiago, Chile, PP: 487-499.
alone data mining concept that gains insight
                                                                [10] Liu, W.L., and Zeng, X.S. (2005). “Document
information of knowledge from huge unstructured
                                                                     Clustering Based on Frequent Term Sets”. Proceedings
databases. Most conventional clustering methods do                   of Intelligent Systems and Control, 2005.
not satisfy the document clustering requirements such
                                                                [11] Zamir, O., Etzioni, O. (1998). “Web Document
as high dimensionality, huge volumes and easy of
                                                                     Clustering: A Feasibility Demonstration”. In the
accessing meaningful clusters labels. In this paper, we              Proceedings of ACM,1998 (SIGIR-98), PP: 46-54.
presented novel approach; Maximal frequent item set
                                                                [12] Kjersti, (1997). “A Survey on Personalized Information
(MFI) Based Similarity Measure for Hierarchical
                                                                     Filtering Systems for the World Wide Web”. Technical
Document Clustering to address these issues.                         Report 922, Norwegian Computing Center, 1997.
Dimensionality reduction can be achieved through
                                                                [13] Prasannakumar, J., Govindarajulu, P., “Duplicate and
MFI. By using the same MFI similarity measure in                     Near Duplicate Documents Detection: A Review”.
hierarchal document clustering, the number of levels                 European Journal of Scientific Research ISSN 1450-
will be decreased. It is easy for browsing. Clustering               216X Vol.32 No.4 ,2009, pp:514-527
has its paths in many areas, by applying MFI based              [14] Syed Mudhasir,Y., Deepika,J., “Near Duplicate
techniques to clusters, including data mining, statistics,           Detection and Elimination Based on Web Provenance
biology, and machine learning we can get the high                    for Efficient Web Search”. In the Proceedings of
quality of clusters. Moreover, by means of maximal                   International Journal on Internet and Distributed
frequent item sets, we can predict the most influenced               Computing Systems, Vol.1, No.1, 2011.
objects of clusters in the entire dataset of applications       [15] Alsulami, B.S., Abulkhair, F., Essa, E., “Near Duplicate
like business, marketing, world wide web, social                     Document Detection Survey”. In the Proceedings of
networking analysis.                                                 International Journal of Computer Science and
                                                                     Communications Networks, Vol.2, N0.2, pp:147-151.
                 VII. REFEERENCES                               [16] Doug Burdick, Manuel Calimlim, Johannes Gehrke.
                                                                     (2001). “A Maximal Frequent Itemset Algorithm for
[1] Ruxixu, Donald Wunsch., “A Survey of Clustering
                                                                     Transactional Databases”. In the Proceedings of ICDE,
      Algorithms”. In the Proceedings of IEEE Transactions
                                                                     17th International Conference on Data Engineering
      on Neural Networks, Vol. 16, No. 3, May 2005.
                                                                     2001 (ICDE-2001).
[2]   Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering:
                                                                [17] Murali Krishna, S., Durga Bhavani, S., “An Efficient
      A Review”. In the Proceedings of ACM Computing
                                                                     Approach for Text Clustering Based On Frequent Item
      Surveys, Vol.31, No.3, 1999, pp: 264-323.
                                                                     Sets”. European Journal of Scientific Research ISSN
[3]   Kleinberg, J.M., “Authoritative Sources in a                   1450-216X, Vol.42, No.3, 2010, pp:399-410.
      Hyperlinked Environment”. In the Journal of the ACM,
                                                                [18] Lopresti, D.P. (1999). "Models and Algorithms for
      Vol. 46, No.5, 1999, pp: 604-632.
                                                                     Duplicate Document Detection". In the Proceedings of
[4]   Ling Zhuang, Honghua Dai. (2004). “A Maximal                   Fifth International Conference on Document Analysis
      Frequent Item Set Approach for Web Document                    and Recognition 1999 (ICDAR-1999), 20th-22th Sep,
      Clustering”. In Proceedings of the IEEE Fourth                 pp:297-300.
      International Conference on Computer and Information
      Technology 2004 (CIT-2004).
[5]   Michael, W., Trosset. (2008). “Representing Clusters:
      k-Means Clustering, Self-Organizing Maps and
      Multidimensional      Scaling”.   Technical    Report,
      Department      of   Statistics,  Indian    University,
      Bloomington, 2008.
[6]   Michael Steinbach, George karypis, and Vipinkumar.
      (2000). “A Comparison of Document Clustering
      Techniques”. In Proceedings of the Workshop on Text
      Mining, 2000 (KDD-2000), Boston, pp: 109-111.
[7]   Beil, F., Ester, M., Xu, X. (2002). “Frequent Term-
      Based Text Clustering”. In Proceedings of 8th
      International Conference on Knowledge Discovery and



                                                                                  www.ijorcs.org

Weitere ähnliche Inhalte

Was ist angesagt?

AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
csandit
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 

Was ist angesagt? (14)

Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
 
Sub1522
Sub1522Sub1522
Sub1522
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A Review
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYINTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
 
I6 mala3 sowmya
I6 mala3 sowmyaI6 mala3 sowmya
I6 mala3 sowmya
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted Data
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 

Ähnlich wie Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering

AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
International Journal of Technical Research & Application
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
IJRAT
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
IJMER
 

Ähnlich wie Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering (20)

AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
 
J017145559
J017145559J017145559
J017145559
 
Challenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringChallenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document Clustering
 
600 608
600 608600 608
600 608
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
 
IRJET- Proficient Recovery Over Records using Encryption in Cloud Computing
IRJET- Proficient Recovery Over Records using Encryption in Cloud ComputingIRJET- Proficient Recovery Over Records using Encryption in Cloud Computing
IRJET- Proficient Recovery Over Records using Encryption in Cloud Computing
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
653 656
653 656653 656
653 656
 

Mehr von IJORCS

Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsHelp the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
IJORCS
 
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
IJORCS
 
A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...
IJORCS
 

Mehr von IJORCS (20)

Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsHelp the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
 
Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4
 
Real-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition SystemReal-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition System
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
 
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
 
Algebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression FunctionAlgebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression Function
 
Enhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State LogicEnhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State Logic
 
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
 
CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2
 
Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1
 
Voice Recognition System using Template Matching
Voice Recognition System using Template MatchingVoice Recognition System using Template Matching
Voice Recognition System using Template Matching
 
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and FairnessChannel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
 
A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...
 
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
 
A Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETsA Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETs
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
 
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemAn Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
 
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
 
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 

Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering

  • 1. International Journal of Research in Computer Science eISSN 2249-8265 Volume 2 Issue 4 (2012) pp. 7-12 © White Globe Publications www.ijorcs.org PRIVACY PRESERVING MFI BASED SIMILARITY MEASURE FOR HIERARCHICAL DOCUMENT CLUSTERING P. Rajesh1, G. Narasimha2, N.Saisumanth3 1,3 Department of CSE, VVIT, Nambur, Andhra Pradesh, India Email: rajesh.pleti@gmail.com Email: saisumanth.nanduri@gmail.com 2 Department of CSE, JNTUH, Hyderabad, Andhra Pradesh, India Email: narasimha06@gmail.com Abstract: The increasing nature of World Wide Web navigation steps to find relevant documents. So we has imposed great challenges for researchers in need a hierarchical clustering that is relatively flat that improving the search efficiency over the internet. Now reduces the number of navigation steps. Therefore days web document clustering has become an there is a great need for new document clustering important research topic to provide most relevant algorithms, which are more efficient than conventional documents in huge volumes of results returned in clustering algorithms [1, 2]. response to a simple query. In this paper, first we The increasing nature of World Wide Web has proposed a novel approach, to precisely define imposed great challenges for researchers to cluster the clusters based on maximal frequent item set (MFI) by similar documents over the internet and their by Apriori algorithm. Afterwards utilizing the same improving the efficiency of search. Search engine uses maximal frequent item set (MFI) based similarity the getting more confused in selecting the relevant measure for Hierarchical document clustering. By documents among huge volumes of search results considering maximal frequent item sets, the returned to a simple query. A potential solution to this dimensionality of document set is decreased. Secondly, problem is to cluster the similar web documents, which providing privacy preserving of open web documents helps the user in identifying the relevant data easily is to avoiding duplicate documents. There by we can and effectively [3]. protect the privacy of individual copy rights of documents. This can be achieved using equivalence The outline of this paper is divided into six relation. sections. section II, briefly discusses related work. We explained our proposed algorithm description Keywords: Maximal Frequent Item set, Apriori including common preprocessing steps and pseudo algorithm, Hierarchical document clustering, code of algorithm in section III. It also includes to equivalence relation. precisely defining clusters based on maximal frequent item set (MFI) by Apriori algorithm. Section IV, I. INTRODUCTION describes exploiting the same maximal frequent item Document clustering has been studied intensively set (MFI) based similarity measure for Hierarchical because of its wide applicability in areas such as web document clustering with running example. In section mining, search engines, text mining and information V, provides privacy preserving of open web retrieval. The rapid progress of databases in every documents by using equivalence relation to protect the aspect of human actions has resulted in enormous individual copy rights of a document.. Section VI, demand for efficient algorithms for spinning data into consists of conclusion and future scope. valuable knowledge. II. RELATED WORK Document clustering has undergone through various methods, still document clustering is in its The related work of using maximal frequent item inefficiency state for providing the required set in web document clustering is explained as follows. information needed by the user exactly and Ling Zhuang Honghua Dai [4] introduced a new approximately. Suppose the user makes an incorrect criterion to specifically locate the initial points using selection while browsing the documents in hierarchy. maximal frequent item set. These initial points are then If user may not notice his mistakes until he browses used as centers for k-means algorithm. However k- into the deep portion of the hierarchy, then it decreases means clustering is completely unstructured approach, the efficiency of search and increases the number of sensitive to noise and produces an unorganized www.ijorcs.org
  • 2. 8 P. Rajesh, G. Narasimha, N.Saisumanth collection of clusters that is not favorable to based similarity measure . The clusters in the resulting interpretation [5, 6]. To minimize the overlapping of hierarchy are non-overlapping. The parent cluster documents, Beil, Ester [7] were proposed a method contains only the general documents. HFTC (Hierarchical Frequent Text Clustering) is another frequent item set based approach to choose the III. ALGORITHM DESCRIPTION next frequent item sets. But the clustering result In this section, we explained our proposed depends on the order of choosing next frequent item algorithm description including common sets. The resulting hierarchy in HFTC usually contains preprocessing steps and pseudo code of algorithm. It many clusters at first level. As a result the documents also includes to precisely defining clusters based on in the same class are to be distributed into different maximal frequent item set (MFI) by Apriori algorithm. branches of hierarchy, which decreases the overall First, we will speak about some common clustering accuracy. preprocessing steps for representing each document by C.M.Fung [8] has introduced FIHC (Frequent Item item sets (terms). Second we will bring in vector space set based Hierarchical Clustering) method for model by assigning weights to terms in all document document clustering. Which employed, a cluster topic sets. Finally, we will explain the process of tree is constructed based on the similarity among initialization of clusters seeds using MFI to perform clusters. FIHC used the efficient child pruning when hierarchical clustering. Let Ds represents set of all number of clusters is large and to apply the elaborated documents in collection of database. sibling merging only when number of clusters is small. Ds= {d1, d2, d3………dM}: 1 ≤ i ≤ M The experiment results FIHC actually outperforms all other algorithms (bisecting-k means, UPGMA) in A. Pre-Processing accuracy for most number of clusters. The document set Ds is converted from The Apriori algorithm [9] is a well-known method unstructured format into some common representation for computing frequent item sets in a transaction using the text preprocessing techniques, in which database. The document under the same topic, shares words or terms are extracted (tokenization). The input more common frequent item sets (terms) than the data set of documents in Ds are preprocessed using the documents of different topics. The main advantage of techniques namely, removing HTML tags first, after using frequent item sets is that it can identify the that apply stop words list and stemming algorithm. relation among the more than two documents at a time a) HTML Tags: parsing of HTML Tag in a document collection unlike similarity measure b) Stop words: Remove the stop words list like between two documents [10, 11].By the means of “conjunctions, connectives, prepositions etc” maximal frequent item sets, the dimensionality of the c) Stemming algorithm: We utilize porter 2 document set is reduced. More over maximal frequent stemmer algorithm in our approach. item sets captures most related document sets. On the other hand, hierarchical clustering most relevant for B. Vector representation of document: browsing and maps most specific documents to generalized documents in the whole collection. Vector space model is the most commonly used document representation model in text mining, web A conventional hierarchical clustering method mining and information retrieval areas. In this model constructs the hierarchy by subdividing parent cluster each document is represented as n-dimensional term or merging similar children clusters. It usually suffers vector. The value of each term in the n-dimensional from its inability to perform tuning once a merge or vector reflects the importance of corresponding split decision has been performed. This rigidity may document. Let N be the total number of terms and M lower the clustering accuracy. Furthermore, due to the be the number of documents and each the document 𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤ i≤ M. Where fact that a parent cluster in the hierarchy always can be denoted as 𝑑𝑓(𝑡𝑒𝑟𝑚 𝑖𝑗 ) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 contains all objects of its Childs, this kind of hierarchy frequency 𝑡𝑒𝑟𝑚 𝑖𝑗 is less than the threshold value is is not suitable for browsing. The user may have value. The document difficulty to locate his intention object in such a large cluster. considered to avoid the problem of more times a term Our hierarchical clustering method is completely appears throughout all documents in the whole different. The aim of this paper is, first we form all collection, the more poorly it discriminates between the clusters by assigning documents to the most similar documents [12].Calculate term frequency tf is number cluster using maximal frequent item sets by Apriori of times a term appears in a document. Document frequency of a term df as no of documents that documents vectors. 𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 ) algorithm and then construct the hierarchical document clustering based on their inter-cluster contains term. Also construct the weights for Where 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝐼𝐷𝑓(𝑗) and similarities via same maximal frequent item set (MFI) www.ijorcs.org
  • 3. IDf (j) =𝑙𝑜𝑔 � �1≤j≤n.where IDf is the inverse 𝑚 Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 9 𝑑𝑓 𝑗 A frequent item set is a set of words which occurs frequently together and are good candidates for such that X ⊂ X1 and t(X) = t(X1), where t(X) defined document frequency. clusters and are denoted by FI. An item set X is closed Table 1: Table Representation of Transactional Database of if there does not exist an item set X1 such that X1, Documents as the set of transactions that contain item set X and it Terms Doc 1 Doc 2 Doc 3 ..... Doc 4 is denoted by FCI(frequently closed items).If X is Java 1 1 0 ..... 1 frequent and no superset of X is frequent among the Beans 0 1 0 ..... 0 MFI. Then MFI⊂ FCI ⊂ FI Whenever there are very set of items I in transactional databases. Then we say ..... ..... ….. ….. ..... ….. that X is maximal frequent item set and denoted by Servlets 1 0 1 ..... 1 By the representation of document as vector form, long patterns are present in the data it is often we can easily identify which documents Contains the impractical to generate the entire set if frequent item same features .The more features documents have in sets or closed item sets [16]. In that case, maximal common, the more related they are. Thus, it is realistic frequent item sets are adequate for such applications. to find well related documents. Assume that each We employed maximal frequent item set algorithm document is an item in the transactional database; each from [17] using apriori. These maximal frequent item term corresponds to a transaction. Our aim is to search sets are initial seeds for hierarchical document for highly related documents “appearing” together clustering. with same features (the documents whose MFI features D. Pseudo code Algorithm are closed). Similarly, the maximal frequent item set discovery in the transaction database serves the For MFI Based Similarity Measure for Hierarchical purpose of finding items of documents appearing Document Clustering together in many transactions. i.e., document sets Input: Document set Ds. which have large amount of feature in common. Definition: MFI: Maximal Frequent Item set. C. Apriori for maximal frequent item sets (tf) Term frequency and (df) document frequency Mining frequent item sets is a primary content of Step 1. For each document in Ds, Remove the HTML data mining that emphasizes particularly in finding the relation of different items in the large database. Mining tags and perform stop word list and stemming. Step 2. Calculate the term frequency (tf) and document 𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤i≤M frequent patterns is crucial problem in many data mining applications such as the discovery of frequency (df). Where df�𝑡𝑒𝑟𝑚 𝑖𝑗 � < Threshold value association rules, correlations, multidimensional patterns, and other numerous important inferring patterns from consumer market basket analysis and web access etc. The association mining problem is Step 3. Also construct the weighted document vectors 𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 ) 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗ formulated as follows: Given a large data base of set of for all the documents 𝐼𝐷𝑓(𝑗).Idf (j) =𝑙𝑜𝑔 � � 1≤j≤n. items transactions, find all frequent item sets, where a 𝑚 Where frequent item set is one that occurs in at least a user- 𝑑𝑓 𝑗 specified threshold value of the data base. Many of the proposed item set mining algorithms are a variant of Step 4. Now represent each documents by keywords Apriori, which employs a bottom-up, breadth first whose tf>support search that enumerates every single frequent item set. 𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 , … … … … . . 𝐹 𝑛 } Apriori is a conventional algorithm that was first Calculate the Maximal Frequent Item set(MFI) of introduced] for mining association rules. Association terms using Apriori algorithm Where each 𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 , … … … 𝑑 𝑘 } can be viewed as two-step process as a document 𝑑 𝑖 is in more than one maximal frequent item set then choose 𝐼 𝑑 as a set (1) Identifying all frequent item sets Step 5. If (2) Generating strong association rules from the containing document 𝑑 𝑖 . Then Assign𝐼 𝑥 =𝐼 𝑑0 .For frequent item sets consisting of such maximal frequent item sets At first, candidate item sets are generated and the document 𝑑 𝑖 afterwards frequent item sets are mined with the help each the maximal frequent item sets containing 𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑 𝑖 )) of these candidate item sets. In the proposed approach, > 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑 𝑖 ))] we have used only the frequent item sets for further processing so that, we undergone only the first step (generation of maximal frequent item sets) of the Apriori algorithm. www.ijorcs.org
  • 4. Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 .Assign the document 𝑑 𝑖 to 𝐼 𝑥 𝐹𝑖 𝑙𝑖𝑘𝑒 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 } as one cluster in hierarchy 10 P. Rajesh, G. Narasimha, N.Saisumanth and discard 𝑑 𝑖 for other maximal frequent item sets. Case 3: If 𝐹𝑖 , 𝐹𝑗 contains some same documents and represent it by center (as in step6). Repeat this process for all documents that occurs in consider the case of document 𝑑2 is repeatedin more more than one maximal frequent item set these maximal frequent item sets 𝐹𝑖 as clusters than one maximal frequent item sets{𝐹1 𝐹4 }.Similarly among the documents list obtained from MFI. Let us and combine the documents in 𝐹𝑖 into a single Step 6. Apply hierarchical document clustering to make 𝑑4 is repeated in{ 𝐹1 , 𝐹2 , 𝐹4 }. Then choose𝐼 𝑑 = { 𝐹1 , 𝐹2 , 𝐹4 } = { 𝐼 𝑑0 , 𝐼 𝑑1 , 𝐼 𝑑2 }for document𝑑4 .Assign 𝐼 𝑥 =𝐼 𝑑0 = 𝐹1 . For each the maximal frequent item sets new document and represent it by centers of the 𝐼 𝑑 containing 𝑑4 maximal frequent item sets. These are obtained 𝐼 𝑑0 𝑡𝑜 𝐼 𝑑2 calculate the measure by combining the features of maximal frequent in the document from 𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑4 )) item set of terms that grouping the documents > 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑4 ))] Step 7. Repeat the same process of hierarchical document clustering based on maximal frequent document 𝑑4 closest to which maximal frequent item item sets for all levels in hierarchy and stop if total number of documents equals to one else go By using this jaccards measure, we can identify the document 𝑑4 .Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 . to step 4. set among maximal frequent item sets containing the Let’s suppose that 𝑑4 is closed to the maximal IV. HIERARCHICAL CLUSTERS BASED ON frequent item set 𝐹4 . Assign the document𝑑4 to𝐼 𝑥 = MAXIMAL FREQUENT ITEM SETS 𝐼 𝑑𝑖 = 𝐹4 and discard 𝑑4 for other maximal frequent After finding maximal frequent item sets (MFI) by using Apriori algorithm. We turn to describing the exactly one cluster. Similarly 𝑑2 belongs to𝐹1 .Repeat creation of hierarchical document clustering using item sets. After this step, each document belongs to same similarity measure by MFI. A simple instance among the whole collection of documents 𝐷 𝑆 by case of example is also provided to demonstrate the 𝑑2 , 𝑑4 are repeated in𝐹1 , 𝐹4 . The clusters that will form this process for all documents that occurs in more than apriorialgorithm are 𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 … . . 𝐹 𝑛 }.Where entire process. The set of maximal frequent item sets one maximal frequent item set. Since the documents by𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 … . . 𝑑 𝑘 }.Then consider total number at the first level of hierarchy by applying step5 and 𝐹1 = {𝑑2 , 𝑑6 } each MFI consist of set of documents represented step 6 are as follows. 𝐹2 = {𝑑3 , , 𝑑8 } of documents which occurs in maximal frequent item 𝑑1 , 𝑑2 , 𝑑3, 𝑑4 , 𝑑5 , 𝑑6 , 𝑑7 , 𝑑8 , 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 } sets in MFI as follows. 𝑀𝐹𝐼 = � � 𝑑9 , 𝑑10 , 𝑑11 , 𝑑12 , 𝑑13 , 𝑑14 , 𝑑15 𝐹4 = {𝑑4 , , 𝑑14 } 𝐹1 = {𝑑2 , 𝑑4 , 𝑑6 } 𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 } 𝐹2 = {𝑑3 , 𝑑4 , 𝑑8 } 𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 } 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 } 𝐹4 = {𝑑4 , 𝑑2 , 𝑑14 } The hierarchical diagram for the above form of 𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 } maximal frequent item set clusters can be representing 𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 } as follows. Repeat the same process of hierarchical document clustering based on maximal frequent item sets for all levels in hierarchy and stop if total number The clusters in the resulting hierarchy are non- of documents equals to one else go to step 4. overlapping. This can be achieved through the Case1: If 𝐹𝑖 , 𝐹𝑗 are same then choose one in random following cases. Case2: If 𝐹𝑖 , 𝐹𝑗 are different then form clusters of to form cluster. documents contained in𝐹𝑖 , 𝐹𝑗 independently. In our in 𝐹3 , 𝐹5 and 𝐹6 𝑎𝑟𝑒 different. So we form a clusters example, the maximal frequent item set of documents according to the documents contained in Figure 1: Hierarchical document clustering using MFI www.ijorcs.org
  • 5. Represent each new document �𝐿 𝑖𝑗 � in hierarchy by Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 11 itself. When we are classifying the documents into maximal frequent item set of terms as centers (as in equivalence classes, we are not considering these ones step 6).These maximal frequent item sets are obtained and put zeros. Jaccard similarity coefficient matrix for by combining the features of maximal frequent item four documents can be represented as follows. set of terms that grouping the documents. Each new d1 d2 d3 d4 �𝐿 𝑖𝑗 � represents that jth document in the level of document also consisting of corresponding updated weights of maximal frequent item set of terms. Where d 1  1 0.4 0.8 0.5 hierarchy𝐿 𝑖 . In the figure { 𝐿12 = 𝐿21 }means that the d 2 0.4 1 0.8 0.4 Rα =   level 𝐿1 are not matched with other documents MFI set d 3 0.8 0.8 1 0.9 maximal frequent item set of terms in 2nd document of   d 4 0.5 0.4 0.9 1  in same level𝐿1 .So it is repeated same for the next level and it is also same for the document { 𝐿13 = Ds = {d1 , d2 , d3 , d4 }as the collectionof document pairs 𝐿22 }. The documents{ 𝐿11 , 𝐿15 } and{ 𝐿14 , 𝐿16 } in first Where alpha is threshold. Let define a relation R on value. i.e 𝑅 = {(𝑑 𝑖 , 𝑑 𝑗 )/ 𝐽 (𝑑 𝑖 , 𝑑 𝑗 ) ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 } whose similarity measure is above some threshold level as 𝐿23 , 𝐿24 . level are combined using MFI based hierarchical 1. R is reflexive on Ds iff 𝑅 (𝑑 𝑖 , 𝑑 𝑖 ) = 1. i.e Every clustering and represent these documents in the second 2. R is symmetric on Ds iff𝑅 �𝑑 𝑖 , 𝑑 𝑗 � = 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �i.e document is mostly related to itself. if the document 𝑑 𝑖 is similar to 𝑑 𝑗 then the V. PRIVACY PRESERVING OF WEB document 𝑑 𝑗 is also similar to𝑑 𝑖 . DOCUMENTS USING EQUIVALENCE RELATION Most internet web documents are publicly available 𝑅 (𝑑 𝑖 , 𝑑 𝑘 ) ≥ 𝑚𝑎𝑥 𝑗 { min{𝑅 �𝑑 𝑖 , 𝑑 𝑗 �, 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �}}. for providing services required by the user. In such 3. R is transitive on Ds iff documents there is no confidential or sensitive data (open to all). Then how can we provide privacy of such documents. Now a days, same information will Then R is transitive by the definition. be exists in more than one document in duplicate Then R is an equivalence relation on Ds, which forms. The way of providing privacy preserving of partitions the input document set Ds into set of documents is by avoiding duplicate documents. There equivalence classes. Equivalence relation seems a by we can protect the privacy of individual copy rights natural technique for duplicate document of documents. Many duplicate document detection categorization. Any two documents in same techniques are available such as syntactic, URL based, equivalence class are related and are different if they semantic approaches. In each technique, a processing are coming from two equivalence classes. The set of overhead of maintaining shingling’s, signatures, all equivalence classes induces the document set Ds. fingerprints [13, 14, 15, 18]. In this paper, we High syntactic similarity pairs of documents typically proposed a new technique for avoiding duplicate referred to as duplicates or near duplicates except documents using equivalence relation. Let Ds be the diagonal elements. By using equivalence relation, input duplicate document set is subset to web easily we can identify the duplicate documents or we document collection. First find the jaccard similarity can perform the clustering on duplicate documents. measure for every pair of documents in Ds using Apart from the representation of feature document weighted feature representation of maximal frequent vector by MFI, we also need to consider that who is item sets discussed in step 2 and step 3 in algorithm. If the author of document, when the document was the similarity measure of two documents is equal to 1, created, where it is available, helps in effectively then the two documents are most similar. If the finding the duplicate documents. Each document in measure is 0, then they are not duplicates. The Jaccard input Ds must belong to unique equivalence class. If R index or the Jaccard similarity coefficient is a is equivalence relation on Ds = {d1, d2, d3, d4 …..dn}. statistical measure of similarity between sample sets. Then number of equivalence relations on Ds is always For two sets, it is denoted as the cardinality of their lies between n ≤ | R|≤ n2. i.e the time complexity of intersection divided by the cardinality of their union. |𝑑1 ∩ 𝑑2 | calculating equivalence relation on Ds is O(n2). .i.e𝐽 �𝑑 𝑖 , 𝑑 𝑗 � ≥ 0.8. Since the matrix is symmetric, the Mathematically 𝐽(𝑑1 , 𝑑2 ) = Choose the threshold α in equivalence relation as 0.8 |𝑑1 ∩ 𝑑2 | documents sets {(𝑑3 , 𝑑1 ), (𝑑3 , 𝑑2 ), (𝑑4 , 𝑑3 )} are mostly related. Hence the documents are near For every pair of two documents calculate jaccard duplicates and grouping the documents into clusters measure of d1, d2.All the diagonal elements in matrix thereby providing privacy of individual copy rights of are ones, because every document mostly related to documents. www.ijorcs.org
  • 6. 12 P. Rajesh, G. Narasimha, N.Saisumanth 0 0 1 0 Data mining 2002 (KDD-2002), Edmonton, Alberta, 0 0 1 0 Canada. R 0.8 =   [8] BenjaminFung, C.M., Wang, Ke., Ester, Martin. (2003). 1 1 0 1 “Hierarchical Document Clustering using Frequent Item   Sets”. In Proceedings SIAM International Conference 0 0 1 0 on Data Mining 2003 (SIAM DM-2003), pp:59-70. [9] Agrawal, R., Srikant, R. (1994). “Fast Algorithms for VI. CONCLUSION AND FUTURE SCOPE Mining Association Rules”. In the Proceedings of 20th International Conference on Very Large Data Bases, Cluster analysis can be used as powerful ,stranded 1994, Santiago, Chile, PP: 487-499. alone data mining concept that gains insight [10] Liu, W.L., and Zeng, X.S. (2005). “Document information of knowledge from huge unstructured Clustering Based on Frequent Term Sets”. Proceedings databases. Most conventional clustering methods do of Intelligent Systems and Control, 2005. not satisfy the document clustering requirements such [11] Zamir, O., Etzioni, O. (1998). “Web Document as high dimensionality, huge volumes and easy of Clustering: A Feasibility Demonstration”. In the accessing meaningful clusters labels. In this paper, we Proceedings of ACM,1998 (SIGIR-98), PP: 46-54. presented novel approach; Maximal frequent item set [12] Kjersti, (1997). “A Survey on Personalized Information (MFI) Based Similarity Measure for Hierarchical Filtering Systems for the World Wide Web”. Technical Document Clustering to address these issues. Report 922, Norwegian Computing Center, 1997. Dimensionality reduction can be achieved through [13] Prasannakumar, J., Govindarajulu, P., “Duplicate and MFI. By using the same MFI similarity measure in Near Duplicate Documents Detection: A Review”. hierarchal document clustering, the number of levels European Journal of Scientific Research ISSN 1450- will be decreased. It is easy for browsing. Clustering 216X Vol.32 No.4 ,2009, pp:514-527 has its paths in many areas, by applying MFI based [14] Syed Mudhasir,Y., Deepika,J., “Near Duplicate techniques to clusters, including data mining, statistics, Detection and Elimination Based on Web Provenance biology, and machine learning we can get the high for Efficient Web Search”. In the Proceedings of quality of clusters. Moreover, by means of maximal International Journal on Internet and Distributed frequent item sets, we can predict the most influenced Computing Systems, Vol.1, No.1, 2011. objects of clusters in the entire dataset of applications [15] Alsulami, B.S., Abulkhair, F., Essa, E., “Near Duplicate like business, marketing, world wide web, social Document Detection Survey”. In the Proceedings of networking analysis. International Journal of Computer Science and Communications Networks, Vol.2, N0.2, pp:147-151. VII. REFEERENCES [16] Doug Burdick, Manuel Calimlim, Johannes Gehrke. (2001). “A Maximal Frequent Itemset Algorithm for [1] Ruxixu, Donald Wunsch., “A Survey of Clustering Transactional Databases”. In the Proceedings of ICDE, Algorithms”. In the Proceedings of IEEE Transactions 17th International Conference on Data Engineering on Neural Networks, Vol. 16, No. 3, May 2005. 2001 (ICDE-2001). [2] Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering: [17] Murali Krishna, S., Durga Bhavani, S., “An Efficient A Review”. In the Proceedings of ACM Computing Approach for Text Clustering Based On Frequent Item Surveys, Vol.31, No.3, 1999, pp: 264-323. Sets”. European Journal of Scientific Research ISSN [3] Kleinberg, J.M., “Authoritative Sources in a 1450-216X, Vol.42, No.3, 2010, pp:399-410. Hyperlinked Environment”. In the Journal of the ACM, [18] Lopresti, D.P. (1999). "Models and Algorithms for Vol. 46, No.5, 1999, pp: 604-632. Duplicate Document Detection". In the Proceedings of [4] Ling Zhuang, Honghua Dai. (2004). “A Maximal Fifth International Conference on Document Analysis Frequent Item Set Approach for Web Document and Recognition 1999 (ICDAR-1999), 20th-22th Sep, Clustering”. In Proceedings of the IEEE Fourth pp:297-300. International Conference on Computer and Information Technology 2004 (CIT-2004). [5] Michael, W., Trosset. (2008). “Representing Clusters: k-Means Clustering, Self-Organizing Maps and Multidimensional Scaling”. Technical Report, Department of Statistics, Indian University, Bloomington, 2008. [6] Michael Steinbach, George karypis, and Vipinkumar. (2000). “A Comparison of Document Clustering Techniques”. In Proceedings of the Workshop on Text Mining, 2000 (KDD-2000), Boston, pp: 109-111. [7] Beil, F., Ester, M., Xu, X. (2002). “Frequent Term- Based Text Clustering”. In Proceedings of 8th International Conference on Knowledge Discovery and www.ijorcs.org