SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Fast Algorithms for Mining Association Rules
                                  Rakesh Agrawal                   Ramakrishnan Srikant
                                           IBM Almaden Research Center
                                        650 Harry Road, San Jose, CA 95120

Abstract                                                            tires and auto accessories also get automotive services
We consider the problem of discovering association rules            done. Finding all such rules is valuable for cross-
between items in a large database of sales transactions.            marketing and attached mailing applications. Other
We present two new algorithms for solving this problem              applications include catalog design, add-on sales,
that are fundamentally di erent from the known algo-                store layout, and customer segmentation based on
rithms. Empirical evaluation shows that these algorithms            buying patterns. The databases involved in these
outperform the known algorithms by factors ranging from             applications are very large. It is imperative, therefore,
three for small problems to more than an order of mag-              to have fast algorithms for this task.
nitude for large problems. We also show how the best
features of the two proposed algorithms can be combined                The following is a formal statement of the problem
into a hybrid algorithm, called AprioriHybrid. Scale-up              4]: Let I = fi1 i2 . . . im g be a set of literals,
experiments show that AprioriHybrid scales linearly with            called items. Let D be a set of transactions, where
the number of transactions. AprioriHybrid also has ex-              each transaction T is a set of items such that T
cellent scale-up properties with respect to the transaction         I . Associated with each transaction is a unique
size and the number of items in the database.                       identi er, called its TID. We say that a transaction
                                                                    T contains X, a set of some items in I , if X T .
1 Introduction                                                      An association rule is an implication of the form
Progress in bar-code technology has made it possi-                  X =) Y , where X I , Y I , and X  Y = .
ble for retail organizations to collect and store mas-              The rule X =) Y holds in the transaction set D with
sive amounts of sales data, referred to as the basket               con dence c if c% of transactions in D that contain
data. A record in such data typically consists of the               X also contain Y . The rule X =) Y has support s
transaction date and the items bought in the trans-                 in the transaction set D if s% of transactions in D
action. Successful organizations view such databases                contain X Y . Our rules are somewhat more general
as important pieces of the marketing infrastructure.                than in 4] in that we allow a consequent to have more
They are interested in instituting information-driven               than one item.
marketing processes, managed by database technol-                      Given a set of transactions D, the problem of min-
ogy, that enable marketers to develop and implement                 ing association rules is to generate all association rules
customized marketing programs and strategies 6].                    that have support and con dence greater than the
   The problem of mining association rules over basket              user-speci ed minimum support (called minsup ) and
data was introduced in 4]. An example of such a                     minimum con dence (called minconf ) respectively.
rule might be that 98% of customers that purchase                   Our discussion is neutral with respect to the repre-
     Visiting from the Department of Computer Science, Uni-
                                                                    sentation of D. For example, D could be a data le,
versity of Wisconsin, Madison.                                      a relational table, or the result of a relational expres-
     Permission to copy without fee all or part of this material    sion.
is granted provided that the copies are not made or distributed        An algorithm for nding all association rules,
for direct commercial advantage, the VLDB copyright notice
and the title of the publication and its date appear, and notice    henceforth referred to as the AIS algorithm, was pre-
is given that copying is by permission of the Very Large Data       sented in 4]. Another algorithm for this task, called
Base Endowment. To copy otherwise, or to republish, requires        the SETM algorithm, has been proposed in 13]. In
a fee and/or special permission from the Endowment.                 this paper, we present two new algorithms, Apriori
Proceedings of the 20th VLDB Conference                             and AprioriTid, that di er fundamentally from these
Santiago, Chile, 1994                                               algorithms. We present experimental results showing
that the proposed algorithms always outperform the           for an itemset is the number of transactions that
earlier algorithms. The performance gap is shown to          contain the itemset. Itemsets with minimum sup-
increase with problem size, and ranges from a fac-           port are called large itemsets, and all others small
tor of three for small problems to more than an or-          itemsets. In Section 2, we give new algorithms,
der of magnitude for large problems. We then dis-            Apriori and AprioriTid, for solving this problem.
cuss how the best features of Apriori and Apriori-
Tid can be combined into a hybrid algorithm, called       2. Use the large itemsets to generate the desired rules.
AprioriHybrid. Experiments show that the Apriori-            Here is a straightforward algorithm for this task.
Hybrid has excellent scale-up properties, opening up         For every large itemset l, nd all non-empty subsets
the feasibility of mining association rules over very        of l. For every such subset a, output a rule of
large databases.                                             the form a =) (l ; a) if the ratio of support(l)
   The problem of nding association rules falls within       to support(a) is at least minconf. We need to
the purview of database mining 3] 12], also called           consider all subsets of l to generate rules with
knowledge discovery in databases 21]. Related, but           multiple consequents. Due to lack of space, we do
not directly applicable, work includes the induction         not discuss this subproblem further, but refer the
of classi cation rules 8] 11] 22], discovery of causal       reader to 5] for a fast algorithm.
rules 19], learning of logical de nitions 18], tting
of functions to data 15], and clustering 9] 10]. The        In Section 3, we show the relative performance
closest work in the machine learning literature is the    of the proposed Apriori and AprioriTid algorithms
KID3 algorithm presented in 20]. If used for nding        against the AIS 4] and SETM 13] algorithms.
all association rules, this algorithm will make as many   To make the paper self-contained, we include an
passes over the data as the number of combinations        overview of the AIS and SETM algorithms in this
of items in the antecedent, which is exponentially        section. We also describe how the Apriori and
large. Related work in the database literature is         AprioriTid algorithms can be combined into a hybrid
the work on inferring functional dependencies from        algorithm, AprioriHybrid, and demonstrate the scale-
data 16]. Functional dependencies are rules requiring     up properties of this algorithm. We conclude by
strict satisfaction. Consequently, having determined      pointing out some related open problems in Section 4.
a dependency X ! A, the algorithms in 16] consider
any other dependency of the form X + Y ! A                2 Discovering Large Itemsets
redundant and do not generate it. The association         Algorithms for discovering large itemsets make mul-
rules we consider are probabilistic in nature. The        tiple passes over the data. In the rst pass, we count
presence of a rule X ! A does not necessarily mean        the support of individual items and determine which
that X + Y ! A also holds because the latter may          of them are large, i.e. have minimumsupport. In each
not have minimumsupport. Similarly, the presence of       subsequent pass, we start with a seed set of itemsets
rules X ! Y and Y ! Z does not necessarily mean           found to be large in the previous pass. We use this
that X ! Z holds because the latter may not have          seed set for generating new potentially large itemsets,
minimum con dence.                                        called candidate itemsets, and count the actual sup-
   There has been work on quantifying the useful-        port for these candidate itemsets during the pass over
ness" or interestingness" of a rule 20]. What is use-    the data. At the end of the pass, we determine which
ful or interesting is often application-dependent. The    of the candidate itemsets are actually large, and they
need for a human in the loop and providing tools to       become the seed for the next pass. This process con-
allow human guidance of the rule discovery process        tinues until no new large itemsets are found.
has been articulated, for example, in 7] 14]. We do          The Apriori and AprioriTid algorithms we propose
not discuss these issues in this paper, except to point   di er fundamentally from the AIS 4] and SETM 13]
out that these are necessary features of a rule discov-   algorithms in terms of which candidate itemsets are
ery system that may use our algorithms as the engine      counted in a pass and in the way that those candidates
of the discovery process.                                 are generated. In both the AIS and SETM algorithms,
1.1 Problem Decomposition and Paper                       candidate itemsets are generated on-the- y during the
    Organization                                          pass as data is being read. Speci cally, after reading
The problem of discovering all association rules can      a transaction, it is determined which of the itemsets
be decomposed into two subproblems 4]:                    found large in the previous pass are present in the
                                                          transaction. New candidate itemsets are generated by
1. Find all sets of items (itemsets) that have transac-   extending these large itemsets with other items in the
   tion support above minimum support. The support        transaction. However, as we will see, the disadvantage
is that this results in unnecessarily generating and                                   Table 1: Notation
counting too many candidate itemsets that turn out
to be small.                                                    -itemset An itemset having k items.
   The Apriori and AprioriTid algorithms generate
                                                            k

                                                                         Set of large k-itemsets
the candidate itemsets to be counted in a pass by                  Lk    (those with minimum support).
using only the itemsets found large in the previous                      Each member of this set has two elds:
pass { without considering the transactions in the                       i) itemset and ii) support count.
database. The basic intuition is that any subset                         Set of candidate k-itemsets
of a large itemset must be large. Therefore, the                   Ck    (potentially large itemsets).
candidate itemsets having k items can be generated                       Each member of this set has two elds:
by joining large itemsets having k ; 1 items, and                        i) itemset and ii) support count.
deleting those that contain any subset that is not                       Set of candidate k-itemsets when the TIDs
large. This procedure results in generation of a much              Ck    of the generating transactions are kept
smaller number of candidate itemsets.                                    associated with the candidates.
   The AprioriTid algorithm has the additional prop-
erty that the database is not used at all for count-       given transaction t. Section 2.1.2 describes the subset
ing the support of candidate itemsets after the rst        function used for this purpose. See 5] for a discussion
pass. Rather, an encoding of the candidate itemsets        of bu er management.
used in the previous pass is employed for this purpose.
In later passes, the size of this encoding can become      1) L1 = flarge 1-itemsetsg
much smaller than the database, thus saving much           2) for ( k = 2 Lk;1 6= k++ ) do begin
reading e ort. We will explain these points in more        3) Ck = apriori-gen(Lk;1 ) // New candidates
detail when we describe the algorithms.                    4) forall transactions t 2 D do begin
Notation We assume that items in each transaction          5)      Ct = subset(Ck , t)  // Candidates contained in t
                                                           6)      forall candidates c 2 Ct do
are kept sorted in their lexicographic order. It is        7)         c:count++
straightforward to adapt these algorithms to the case      8) end
where the database D is kept normalized and each           9) Lk = fc 2 Ck j c:count minsup g
database record is a <TID, item> pair, where TID is
the identi er of the corresponding transaction.
                                                           10) end
                                                           11) Answer = k Lk
                                                                               S
   We call the number of items in an itemset its size,
and call an itemset of size k a k-itemset. Items within                        Figure 1: Algorithm Apriori
an itemset are kept in lexicographic order. We use
the notation c 1] c 2] . . . c k] to represent a k-
itemset c consisting of items c 1] c 2] . . .c k], where   2.1.1 Apriori Candidate Generation
c 1] < c 2] < . . . < c k]. If c = X Y and Y               The apriori-gen function takes as argument Lk;1,
is an m-itemset, we also call Y an m-extension of          the set of all large (k ; 1)-itemsets. It returns a
X. Associated with each itemset is a count eld to          superset of the set of all large k-itemsets. The
store the support for this itemset. The count eld is       function works as follows. 1 First, in the join step,
initialized to zero when the itemset is rst created.       we join Lk;1 with Lk;1:
   We summarize in Table 1 the notation used in the        insert into k
algorithms. The set C k is used by AprioriTid and will                     C

                                                           select .item1 , .item2 , ..., .itemk;1 , .itemk;1
be further discussed when we describe this algorithm.
                                                                    p              p             p           q

                                                           from k;1 , k;1
                                                                   L       p   L           q


2.1 Algorithm Apriori                                      where .item1 = .item1 , . . ., .itemk;2 = .itemk;2 ,
                                                                       p               q             p           q

                                                                    .itemk;1 < q.itemk;1
Figure 1 gives the Apriori algorithm. The rst pass
                                                                   p


of the algorithm simply counts item occurrences to         Next, in the prune step, we delete all itemsets c 2 Ck
determine the large 1-itemsets. A subsequent pass,         such that some (k ; 1)-subset of c is not in Lk;1:
say pass k, consists of two phases. First, the large          1 Concurrent to our work, the following two-step candidate
itemsets Lk;1 found in the (k ; 1)th pass are used to      generation procedure has been proposed in 17]:
generate the candidate itemsets Ck , using the apriori-             0
                                                                    k =f
                                                                                0j    0 2 k;1 j  0 j = ; 2g
gen function described in Section 2.1.1. Next, the
                                                                   C           X       X   X X   L       X   X    k

                                                                                0
                                                                    k = f 2 k j contains members of k ;1 g
database is scanned and the support of candidates in               C           X       C   X         k           L

                                                           These two steps are similar to our join and prune steps
Ck is counted. For fast counting, we need to e ciently     respectively. However, in general, step 1 would produce a
determine the candidates in Ck that are contained in a     superset of the candidates produced by our join step.
forall itemsets 2 k do
                   c   C                                  later passes when the cost of counting and keeping in
  forall ( ; 1)-subsets of do
           k                   s   c                                             0
                                                          memory additional Ck+1 ; Ck+1 candidates becomes
     if ( 62 k;1 ) then
       s       L                                          less than the cost of scanning the database.
        delete from k
                   c       C
                                                          2.1.2 Subset Function
                                                          Candidate itemsets Ck are stored in a hash-tree. A
Example Let L3 be ff1 2 3g, f1 2 4g, f1 3 4g, f1          node of the hash-tree either contains a list of itemsets
3 5g, f2 3 4gg. After the join step, C4 will be ff1 2 3   (a leaf node) or a hash table (an interior node). In an
4g, f1 3 4 5g g. The prune step will delete the itemset   interior node, each bucket of the hash table points to
f1 3 4 5g because the itemset f1 4 5g is not in L3 .      another node. The root of the hash-tree is de ned to
We will then be left with only f1 2 3 4g in C4.           be at depth 1. An interior node at depth d points to
   Contrast this candidate generation with the one        nodes at depth d+1. Itemsets are stored in the leaves.
used in the AIS and SETM algorithms. In pass k            When we add an itemset c, we start from the root and
of these algorithms, a database transaction t is read     go down the tree until we reach a leaf. At an interior
and it is determined which of the large itemsets in       node at depth d, we decide which branch to follow
Lk;1 are present in t. Each of these large itemsets       by applying a hash function to the dth item of the
l is then extended with all those large items that        itemset. All nodes are initially created as leaf nodes.
are present in t and occur later in the lexicographic     When the number of itemsets in a leaf node exceeds
ordering than any of the items in l. Continuing with      a speci ed threshold, the leaf node is converted to an
the previous example, consider a transaction f1 2         interior node.
3 4 5g. In the fourth pass, AIS and SETM will                Starting from the root node, the subset function
generate two candidates, f1 2 3 4g and f1 2 3 5g,           nds all the candidates contained in a transaction
by extending the large itemset f1 2 3g. Similarly, an     t as follows. If we are at a leaf, we nd which of
additional three candidate itemsets will be generated     the itemsets in the leaf are contained in t and add
by extending the other large itemsets in L3 , leading     references to them to the answer set. If we are at an
to a total of 5 candidates for consideration in the       interior node and we have reached it by hashing the
fourth pass. Apriori, on the other hand, generates        item i, we hash on each item that comes after i in t
and counts only one itemset, f1 3 4 5g, because it        and recursively apply this procedure to the node in
concludes a priori that the other combinations cannot     the corresponding bucket. For the root node, we hash
possibly have minimum support.                            on every item in t.
                                                             To see why the subset function returns the desired
Correctness We need to show that Ck Lk .                  set of references, consider what happens at the root
Clearly, any subset of a large itemset must also          node. For any itemset c contained in transaction t, the
have minimum support. Hence, if we extended each            rst item of c must be in t. At the root, by hashing on
itemset in Lk;1 with all possible items and then          every item in t, we ensure that we only ignore itemsets
deleted all those whose (k ; 1)-subsets were not in       that start with an item not in t. Similar arguments
Lk;1, we would be left with a superset of the itemsets    apply at lower depths. The only additional factor is
in Lk .                                                   that, since the items in any itemset are ordered, if we
   The join is equivalent to extending Lk;1 with each     reach the current node by hashing the item i, we only
item in the database and then deleting those itemsets     need to consider the items in t that occur after i.
for which the (k ;1)-itemset obtained by deleting the
(k;1)th item is not in Lk;1 . The condition p.itemk;1     2.2 Algorithm AprioriTid
< q.itemk;1 simply ensures that no duplicates are         The AprioriTid algorithm, shown in Figure 2, also
generated. Thus, after the join step, Ck Lk . By          uses the apriori-gen function (given in Section 2.1.1)
similar reasoning, the prune step, where we delete        to determine the candidate itemsets before the pass
from Ck all itemsets whose (k ; 1)-subsets are not in     begins. The interesting feature of this algorithm is
Lk;1, also does not delete any itemset that could be      that the database D is not used for counting support
in Lk .                                                   after the rst pass. Rather, the set C k is used
                                                          for this purpose. Each member of the set C k is
Variation: Counting Candidates of Multiple                of the form < TID fXk g >, where each Xk is a
Sizes in One Pass Rather than counting only               potentially large k-itemset present in the transaction
candidates of size k in the kth pass, we can also         with identi er TID. For k = 1, C 1 corresponds to
                       0             0
count the candidates Ck+1, where Ck+1 is generated        the database D, although conceptually each item i
                          0
from Ck , etc. Note that Ck+1 Ck+1 since Ck+1 is          is replaced by the itemset fig. For k > 1, C k is
generated from Lk . This variation can pay o in the       generated by the algorithm (step 10). The member
of C k corresponding to transaction t is <t:T ID,          we generate C4 using L3 , it turns out to be empty,
fc 2 Ck jc contained in tg>. If a transaction does         and we terminate.
not contain any candidate k-itemset, then C k will
not have an entry for this transaction. Thus, the                     Database                            C1

number of entries in C k may be smaller than the                    TID Items                  TID   Set-of-Itemsets
number of transactions in the database, especially for              100 1 3 4                  100   f f1g, f3g, f4g g
large values of k. In addition, for large values of k,              200 2 3 5                  200   f f2g, f3g, f5g g
each entry may be smaller than the corresponding                    300 1 2 3 5                300   f f1g, f2g, f3g, f5g g
transaction because very few candidates may be                      400 2 5                    400   f f2g, f5g g
contained in the transaction. However, for small                                                        C2
values for k, each entry may be larger than the                         L1                       Itemset Support
corresponding transaction because an entry in Ck                Itemset Support                   f1 2g     1
includes all candidate k-itemsets contained in the                f1g      2                      f1 3g     2
transaction.                                                      f2g      3                      f1 5g     1
   In Section 2.2.1, we give the data structures used             f3g      3                      f2 3g     2
to implement the algorithm. See 5] for a proof of                 f5g      3                      f2 5g     3
correctness and a discussion of bu er management.                                                 f3 5g     2

1) L1 = flarge 1-itemsetsg                                   TID
                                                                           C2

                                                                     Set-of-Itemsets                    L2

2) C 1 = database D                                          100     f f1 3g g                   Itemset Support
3) for ( k = 2 Lk;1 6= k++ ) do begin                        200     f f2 3g, f2 5g, f3 5g g      f1 3g     2
4) Ck = apriori-gen(Lk;1 ) // New candidates                 300     f f1 2g, f1 3g, f1 5g,       f2 3g     2
5) C k =                                                             f2 3g, f2 5g, f3 5g g        f2 5g     3
6) forall entries t 2 C k;1 do begin                          400    f f2 5g g                    f3 5g     2
7)       // determine candidate itemsets in Ck contained
         // in the transaction with identi er t.TID
        Ct = fc 2 Ck j (c ; c k ]) 2 t:set-of-itemsets ^
                                                                                                        C3


              (c ; c k ; 1]) 2 t.set-of-itemsetsg
                                                                        C3

                                                                Itemset Support                TID Set-of-Itemsets
8)      forall candidates c 2 Ct do                             f2 3 5g    2                   200 f f2 3 5g g
9)         c:count++
                                                                                               300 f f2 3 5g g
10)     if (Ct 6= ) then C k += < t:TID Ct >
11) end                                                                 L3

12) Lk = fc 2 Ck j c:count minsup g                             Itemset Support
                                                                f2 3 5g
13) end      S
14) Answer = k Lk
                                                                           2

                                                                             Figure 3: Example
           Figure 2: Algorithm AprioriTid                  2.2.1 Data Structures
                                                           We assign each candidate itemset a unique number,
Example Consider the database in Figure 3 and              called its ID. Each set of candidate itemsets Ck is kept
assume that minimum support is 2 transactions.             in an array indexed by the IDs of the itemsets in Ck .
Calling apriori-gen with L1 at step 4 gives the            A member of C k is now of the form < TID fIDg >.
candidate itemsets C2. In steps 6 through 10, we           Each C k is stored in a sequential structure.
count the support of candidates in C2 by iterating over       The apriori-gen function generates a candidate k-
the entries in C 1 and generate C 2. The rst entry in      itemset ck by joining two large (k ; 1)-itemsets. We
C 1 is f f1g f3g f4g g, corresponding to transaction       maintain two additional elds for each candidate
100. The Ct at step 7 corresponding to this entry t        itemset: i) generators and ii) extensions. The
is f f1 3g g, because f1 3g is a member of C2 and          generators eld of a candidate itemset ck stores the
both (f1 3g - f1g) and (f1 3g - f3g) are members of        IDs of the two large (k ; 1)-itemsets whose join
t.set-of-itemsets.                                         generated ck . The extensions eld of an itemset
   Calling apriori-gen with L2 gives C3 . Making a pass    ck stores the IDs of all the (k + 1)-candidates that
over the data with C 2 and C3 generates C 3. Note that     are extensions of ck . Thus, when a candidate ck is
there is no entry in C 3 for the transactions with TIDs                            1         2
                                                           generated by joining lk;1 and lk;1, we save the IDs
100 and 400, since they do not contain any of the              1 and l2 in the generators eld for ck . At the
                                                           of lk;1        k;1
itemsets in C3 . The candidate f2 3 5g in C3 turns         same time, the ID of ck is added to the extensions
out to be large and is the only member of L3. When                   1
                                                             eld of lk;1.
We now describe how Step 7 of Figure 2 is               It thus generates and counts every candidate itemset
implemented using the above data structures. Recall       that the AIS algorithm generates. However, to use the
that the t.set-of-itemsets eld of an entry t in C k;1     standard SQL join operation for candidate generation,
gives the IDs of all (k ; 1)-candidates contained in      SETM separates candidate generation from counting.
transaction t.TID. For each such candidate ck;1 the       It saves a copy of the candidate itemset together with
extensions eld gives Tk , the set of IDs of all the       the TID of the generating transaction in a sequential
candidate k-itemsets that are extensions of ck;1. For     structure. At the end of the pass, the support count
each ck in Tk , the generators eld gives the IDs of       of candidate itemsets is determined by sorting and
the two itemsets that generated ck . If these itemsets    aggregating this sequential structure.
are present in the entry for t.set-of-itemsets, we can       SETM remembers the TIDs of the generating
conclude that ck is present in transaction t.TID, and     transactions with the candidate itemsets. To avoid
add ck to Ct.                                             needing a subset operation, it uses this information
                                                          to determine the large itemsets contained in the
3 Performance                                             transaction read. Lk C k and is obtained by deleting
To assess the relative performance of the algorithms      those candidates that do not have minimum support.
for discovering large sets, we performed several          Assuming that the database is sorted in TID order,
experiments on an IBM RS/6000 530H workstation            SETM can easily nd the large itemsets contained in a
with a CPU clock rate of 33 MHz, 64 MB of main            transaction in the next pass by sorting Lk on TID. In
memory, and running AIX 3.2. The data resided in          fact, it needs to visit every member of Lk only once in
the AIX le system and was stored on a 2GB SCSI            the TID order, and the candidate generation can be
3.5" drive, with measured sequential throughput of        performed using the relational merge-join operation
about 2 MB/second.                                         13].
   We rst give an overview of the AIS 4] and SETM            The disadvantage of this approach is mainly due
 13] algorithms against which we compare the per-         to the size of candidate sets C k . For each candidate
formance of the Apriori and AprioriTid algorithms.        itemset, the candidate set now has as many entries
We then describe the synthetic datasets used in the       as the number of transactions in which the candidate
performance evaluation and show the performance re-       itemset is present. Moreover, when we are ready to
sults. Finally, we describe how the best performance      count the support for candidate itemsets at the end
features of Apriori and AprioriTid can be combined        of the pass, C k is in the wrong order and needs to be
into an AprioriHybrid algorithm and demonstrate its       sorted on itemsets. After counting and pruning out
scale-up properties.                                      small candidate itemsets that do not have minimum
                                                          support, the resulting set Lk needs another sort on
3.1 The AIS Algorithm                                     TID before it can be used for generating candidates
Candidate itemsets are generated and counted on-          in the next pass.
the- y as the database is scanned. After reading a
transaction, it is determined which of the itemsets       3.3 Generation of Synthetic Data
that were found to be large in the previous pass are      We generated synthetic transactions to evaluate the
contained in this transaction. New candidate itemsets     performance of the algorithms over a large range of
are generated by extending these large itemsets with      data characteristics. These transactions mimic the
other items in the transaction. A large itemset l is      transactions in the retailing environment. Our model
extended with only those items that are large and         of the real" world is that people tend to buy sets
occur later in the lexicographic ordering of items than   of items together. Each such set is potentially a
any of the items in l. The candidates generated           maximal large itemset. An example of such a set
from a transaction are added to the set of candidate      might be sheets, pillow case, comforter, and ru es.
itemsets maintained for the pass, or the counts of        However, some people may buy only some of the
the corresponding entries are increased if they were      items from such a set. For instance, some people
created by an earlier transaction. See 4] for further     might buy only sheets and pillow case, and some only
details of the AIS algorithm.                             sheets. A transaction may contain more than one
                                                          large itemset. For example, a customer might place an
3.2 The SETM Algorithm                                    order for a dress and jacket when ordering sheets and
The SETM algorithm 13] was motivated by the desire        pillow cases, where the dress and jacket together form
to use SQL to compute large itemsets. Like AIS,           another large itemset. Transaction sizes are typically
the SETM algorithm also generates candidates on-          clustered around a mean and a few transactions have
the- y based on transactions read from the database.      many items. Typical sizes of large itemsets are also
clustered around a mean, with a few large itemsets        probability of picking the associated itemset.
having a large number of items.                              To model the phenomenon that all the items in
   To create a dataset, our synthetic data generation     a large itemset are not always bought together, we
program takes the parameters shown in Table 2.            assign each itemset in T a corruption level c. When
                                                          adding an itemset to a transaction, we keep dropping
                Table 2: Parameters                       an item from the itemset as long as a uniformly
                                                          distributed random number between 0 and 1 is less
  j j Number of transactions
  D                                                       than c. Thus for an itemset of size l, we will add l
  j j Average size of the transactions
  T                                                       items to the transaction 1 ; c of the time, l ; 1 items
   j j Average size of the maximal potentially
   I
                                                          c(1 ; c) of the time, l ; 2 items c2(1 ; c) of the time,
       large itemsets                                     etc. The corruption level for an itemset is xed and
  j j Number of maximal potentially large itemsets
  L
                                                          is obtained from a normal distribution with mean 0.5
  N    Number of items                                    and variance 0.1.
   We rst determine the size of the next transaction.        We generated datasets by setting N = 1000 and jLj
The size is picked from a Poisson distribution with       = 2000. We chose 3 values for jT j: 5, 10, and 20. We
mean equal to jT j. Note that if each item is chosen      also chose 3 values for jI j: 2, 4, and 6. The number of
with the same probability p, and there are N items,       transactions was to set to 100,000 because, as we will
the expected number of items in a transaction is given    see in Section 3.4, SETM could not be run for larger
by a binomial distribution with parameters N and p,       values. However, for our scale-up experiments, we
and is approximated by a Poisson distribution with        generated datasets with up to 10 million transactions
mean Np.                                                  (838MB for T20). Table 3 summarizes the dataset
   We then assign items to the transaction. Each          parameter settings. For the same jT j and jDj values,
transaction is assigned a series of potentially large     the size of datasets in megabytes were roughly equal
itemsets. If the large itemset on hand does not t in      for the di erent values of jI j.
the transaction, the itemset is put in the transaction
anyway in half the cases, and the itemset is moved to                  Table 3: Parameter settings
the next transaction the rest of the cases.
   Large itemsets are chosen from a set T of such          Name         jT j jI j jDj Size in Megabytes
itemsets. The number of itemsets in T is set to            T5.I2.D100K    5 2 100K 2.4
jLj. There is an inverse relationship between jLj and      T10.I2.D100K 10 2 100K 4.4
                                                           T10.I4.D100K 10 4 100K
the average support for potentially large itemsets.        T20.I2.D100K 20 2 100K 8.4
An itemset in T is generated by rst picking the            T20.I4.D100K 20 4 100K
size of the itemset from a Poisson distribution with       T20.I6.D100K 20 6 100K
mean equal to jI j. Items in the rst itemset
are chosen randomly. To model the phenomenon              3.4 Relative Performance
that large itemsets often have common items, some
fraction of items in subsequent itemsets are chosen       Figure 4 shows the execution times for the six
from the previous itemset generated. We use an            synthetic datasets given in Table 3 for decreasing
exponentially distributed random variable with mean       values of minimum support. As the minimum support
equal to the correlation level to decide this fraction    decreases, the execution times of all the algorithms
for each itemset. The remaining items are picked at       increase because of increases in the total number of
random. In the datasets used in the experiments,          candidate and large itemsets.
the correlation level was set to 0.5. We ran some            For SETM, we have only plotted the execution
experiments with the correlation level set to 0.25 and    times for the dataset T5.I2.D100K in Figure 4. The
0.75 but did not nd much di erence in the nature of       execution times for SETM for the two datasets with
our performance results.                                  an average transaction size of 10 are given in Table 4.
   Each itemset in T has a weight associated with         We did not plot the execution times in Table 4
it, which corresponds to the probability that this        on the corresponding graphs because they are too
itemset will be picked. This weight is picked from        large compared to the execution times of the other
an exponential distribution with unit mean, and is        algorithms. For the three datasets with transaction
then normalized so that the sum of the weights for all    sizes of 20, SETM took too long to execute and
the itemsets in T is 1. The next itemset to be put        we aborted those runs as the trends were clear.
in the transaction is chosen from T by tossing an jLj-    Clearly, Apriori beats SETM by more than an order
sided weighted coin, where the weight for a side is the   of magnitude for large datasets.
T5.I2.D100K                                                                   T10.I2.D100K
              80                                                                            160
                                                            SETM                                                                             AIS
              70                                               AIS                          140                                        AprioriTid
                                                         AprioriTid                                                                      Apriori
                                                           Apriori
              60                                                                            120

              50                                                                            100
Time (sec)




                                                                              Time (sec)
              40                                                                             80

              30                                                                             60

              20                                                                             40

              10                                                                             20

               0                                                                              0
                    2   1.5      1       0.75      0.5          0.33   0.25                       2   1.5      1       0.75      0.5          0.33   0.25
                                     Minimum Support                                                               Minimum Support



                              T10.I4.D100K                                                                  T20.I2.D100K
             350                                                                           1000
                                                               AIS                                                                           AIS
                                                         AprioriTid                         900                                        AprioriTid
             300                                           Apriori                                                                       Apriori
                                                                                            800
             250                                                                            700
                                                                                            600
Time (sec)




                                                                              Time (sec)




             200
                                                                                            500
             150
                                                                                            400

             100                                                                            300
                                                                                            200
              50
                                                                                            100
               0                                                                              0
                    2   1.5      1       0.75      0.5          0.33   0.25                       2   1.5      1       0.75      0.5          0.33   0.25
                                     Minimum Support                                                               Minimum Support



                              T20.I4.D100K                                                                  T20.I6.D100K
             1800                                                                          3500
                                                               AIS                                                                           AIS
             1600                                        AprioriTid                                                                    AprioriTid
                                                           Apriori                         3000                                          Apriori
             1400
                                                                                           2500
             1200
Time (sec)




                                                                              Time (sec)




             1000                                                                          2000

             800                                                                           1500
             600
                                                                                           1000
             400
                                                                                            500
             200

               0                                                                              0
                    2   1.5      1       0.75      0.5          0.33   0.25                       2   1.5      1       0.75      0.5          0.33   0.25
                                     Minimum Support                                                               Minimum Support

                                                              Figure 4: Execution times
Table 4: Execution times in seconds for SETM                               SETM algorithm to perform poorly.2 This explains
                                                                                  the jump in time for SETM in Table 4 when going
     Algorithm                               Minimum Support                      from 1.5% support to 1.0% support for datasets with
                                    2.0% 1.5% 1.0% 0.75% 0.5%                     transaction size 10. The largest dataset in the scale-
                                     Dataset T10.I2.D100K                         up experiments for SETM in 13] was still small
     SETM                              74 161 838 1262 1878                       enough that C k could t in memory hence they did
     Apriori                          4.4    5.3 11.0     14.5 15.3               not encounter this jump in execution time. Note that
                                     Dataset T10.I4.D100K                         for the same minimum support, the support count for
     SETM                              41     91 659       929 1639               candidate itemsets increases linearly with the number
     Apriori                          3.8    4.8 11.2     17.4 19.3               of transactions. Thus, as we increase the number of
                                                                                  transactions for the same values of jT j and jI j, though
   Apriori beats AIS for all problem sizes, by factors                            the size of Ck does not change, the size of C k goes up
ranging from 2 for high minimum support to more                                   linearly. Thus, for datasets with more transactions,
than an order of magnitude for low levels of support.                             the performance gap between SETM and the other
AIS always did considerably better than SETM. For                                 algorithms will become even larger.
small problems, AprioriTid did about as well as                                      The problem with AIS is that it generates too many
Apriori, but performance degraded to about twice as                               candidates that later turn out to be small, causing
slow for large problems.                                                          it to waste too much e ort. Apriori also counts too
                                                                                  many small sets in the second pass (recall that C2 is
3.5 Explanation of the Relative Performance                                       really a cross-product of L1 with L1 ). However, this
To explain these performance trends, we show in                                   wastage decreases dramatically from the third pass
Figure 5 the sizes of the large and candidate sets in                             onward. Note that for the example in Figure 5, after
di erent passes for the T10.I4.D100K dataset for the                              pass 3, almost every candidate itemset counted by
minimum support of 0.75%. Note that the Y-axis in                                 Apriori turns out to be a large set.
this graph has a log scale.                                                          AprioriTid also has the problem of SETM that C k
                                                                                  tends to be large. However, the apriori candidate
                       1e+07
                                                                                  generation used by AprioriTid generates signi cantly
                                                                                  fewer candidates than the transaction-based candi-
                                                          C-k-m (SETM)
                                                       C-k-m (AprioriTid)
                                                                                  date generation used by SETM. As a result, the C k of
                       1e+06                            C-k (AIS, SETM)
                                                  C-k (Apriori, AprioriTid)
                       100000                                           L-k       AprioriTid has fewer entries than that of SETM. Apri-
                                                                                  oriTid is also able to use a single word (ID) to store
  Number of Itemsets




                       10000                                                      a candidate rather than requiring as many words as
                                                                                  the number of items in the candidate.3 In addition,
                        1000                                                      unlike SETM, AprioriTid does not have to sort C k .
                                                                                  Thus, AprioriTid does not su er as much as SETM
                         100
                                                                                  from maintaining C k .
                          10                                                         AprioriTid has the nice feature that it replaces a
                                                                                  pass over the original dataset by a pass over the set
                           1                                                      C k . Hence, AprioriTid is very e ective in later passes
                                1    2     3        4      5            6     7   when the size of C k becomes small compared to the
Figure 5: Sizes of the large and candidate sets
                                               Pass Number
                                                                                     2 The cost of external sorting in SETM can be reduced
(T10.I4.D100K, minsup = 0.75%)                                                    somewhat as follows. Before writing out entries in k to C

   The fundamental problem with the SETM algo-                                    disk, we can sort them on itemsets using an internal sorting
rithm is the size of its C k sets. Recall that the size of                        procedure, and write them as sorted runs. These sorted runs
                                                                                  can then be merged to obtain support counts. However,
the set C k is given by                                                           given the poor performance of SETM, we do not expect this
                                    X
                             support-count(c):
                                                                                  optimization to a ect the algorithm choice.
                                                                                     3 For SETM to use IDs, it would have to maintain two
                                                                                  additional in-memory data structures: a hash table to nd
        candidate itemsets c                                                      out whether a candidate has been generated previously, and
                                                                                  a mapping from the IDs to candidates. However, this would
Thus, the sets C k are roughly S times bigger than the                            destroy the set-oriented nature of the algorithm. Also, once we
corresponding Ck sets, where S is the average support                             have the hash table which gives us the IDs of candidates, we
count of the candidate itemsets. Unless the problem                               might as well count them at the same time and avoid the two
                                                                                  external sorts. We experimentedwith this variant of SETM and
size is very small, the C k sets have to be written                               found that, while it did better than SETM, it still performed
to disk, and externally sorted twice, causing the                                 much worse than Apriori or AprioriTid.
size of the database. Thus, we nd that AprioriTid             in Ck . From this, we estimate what the size of C k
beats Apriori when its C k sets can t in memory and
the distribution of the large itemsets has a long tail.
                                                                                    P
                                                              would have been if it had been generated. This
                                                              size, in words, is ( candidates c 2 Ck support(c) +
When C k doesn't t in memory, there is a jump in              number of transactions). If C k in this pass was small
the execution time for AprioriTid, such as when going         enough to t in memory, and there were fewer large
from 0.75% to 0.5% for datasets with transaction size         candidates in the current pass than the previous pass,
10 in Figure 4. In this region, Apriori starts beating        we switch to AprioriTid. The latter condition is added
AprioriTid.                                                   to avoid switching when C k in the current pass ts in
                                                              memory but C k in the next pass may not.
3.6 Algorithm AprioriHybrid                                      Switching from Apriori to AprioriTid does involve
It is not necessary to use the same algorithm in all the      a cost. Assume that we decide to switch from Apriori
passes over data. Figure 6 shows the execution times          to AprioriTid at the end of the kth pass. In the
for Apriori and AprioriTid for di erent passes over the       (k + 1)th pass, after nding the candidate itemsets
dataset T10.I4.D100K. In the earlier passes, Apriori          contained in a transaction, we will also have to add
does better than AprioriTid. However, AprioriTid              their IDs to C k+1 (see the description of AprioriTid
beats Apriori in later passes. We observed similar            in Section 2.2). Thus there is an extra cost incurred
relative behavior for the other datasets, the reason          in this pass relative to just running Apriori. It is only
for which is as follows. Apriori and AprioriTid               in the (k +2)th pass that we actually start running
use the same candidate generation procedure and               AprioriTid. Thus, if there are no large (k+1)-itemsets,
therefore count the same itemsets. In the later               or no (k + 2)-candidates, we will incur the cost of
passes, the number of candidate itemsets reduces              switching without getting any of the savings of using
(see the size of Ck for Apriori and AprioriTid in             AprioriTid.
Figure 5). However, Apriori still examines every                 Figure 7 shows the performance of AprioriHybrid
transaction in the database. On the other hand,               relative to Apriori and AprioriTid for three datasets.
rather than scanning the database, AprioriTid scans           AprioriHybrid performs better than Apriori in almost
C k for obtaining support counts, and the size of C k         all cases. For T10.I2.D100K with 1.5% support,
has become smaller than the size of the database.             AprioriHybrid does a little worse than Apriori since
When the C k sets can t in memory, we do not even             the pass in which the switch occurred was the
incur the cost of writing them to disk.                       last pass AprioriHybrid thus incurred the cost of
               14                                             switching without realizing the bene ts. In general,
                                               Apriori        the advantage of AprioriHybrid over Apriori depends
               12
                                             AprioriTid
                                                              on how the size of the C k set decline in the later
                                                              passes. If C k remains large until nearly the end and
               10
                                                              then has an abrupt drop, we will not gain much by
                                                              using AprioriHybrid since we can use AprioriTid only
  Time (sec)




                8
                                                              for a short period of time after the switch. This is
                6                                             what happened with the T20.I6.D100K dataset. On
                                                              the other hand, if there is a gradual decline in the
                4                                             size of C k , AprioriTid can be used for a while after the
                                                              switch, and a signi cant improvement can be obtained
                2                                             in the execution time.
                0
                    1   2   3     4      5          6     7
                                                              3.7 Scale-up Experiment
Figure 6: Per pass execution times of Apriori and
                                Pass #                        Figure 8 shows how AprioriHybrid scales up as the
AprioriTid (T10.I4.D100K, minsup = 0.75%)                     number of transactions is increased from 100,000 to
                                                              10 million transactions. We used the combinations
   Based on these observations, we can design a               (T5.I2), (T10.I4), and (T20.I6) for the average sizes
hybrid algorithm, which we call AprioriHybrid, that           of transactions and itemsets respectively. All other
uses Apriori in the initial passes and switches to            parameters were the same as for the data in Table 3.
AprioriTid when it expects that the set C k at the            The sizes of these datasets for 10 million transactions
end of the pass will t in memory. We use the                  were 239MB, 439MB and 838MB respectively. The
following heuristic to estimate if C k would t in             minimum support level was set to 0.75%. The
memory in the next pass. At the end of the                    execution times are normalized with respect to the
current pass, we have the counts of the candidates            times for the 100,000 transaction datasets in the rst
graph and with respect to the 1 million transaction
                                                                           dataset in the second. As shown, the execution times
                                                                           scale quite linearly.
T10.I2.D100K
                 40
                                                                                               12
                                                     AprioriTid
                 35                                      Apriori                                                                            T20.I6
                                                  AprioriHybrid                                                                             T10.I4
                                                                                               10                                            T5.I2
                 30

                 25                                                                             8
  Time (sec)




                                                                             Relative Time
                 20
                                                                                                6
                 15
                                                                                                4
                 10

                  5                                                                             2

                  0
                      2   1.5   1       0.75      0.5        0.33   0.25                        0
                                    Minimum Support                                              100    250              500             750         1000
T10.I4.D100K                                                                                   14
                                                                                                              Number of Transactions (in ’000s)
                 55                                                                                                                         T20.I6
                                                     AprioriTid                                                                             T10.I4
                 50                                      Apriori                               12                                            T5.I2
                 45                               AprioriHybrid
                                                                                               10
                 40
                                                                             Relative Time


                 35                                                                             8
  Time (sec)




                 30
                 25                                                                             6

                 20
                                                                                                4
                 15
                 10                                                                             2
                  5
                                                                                                0
                  0                                                                                 1   2.5            5              7.5            10
                      2   1.5   1       0.75      0.5        0.33   0.25                                  Number of Transactions (in Millions)
                                    Minimum Support
T20.I6.D100K                                                                                 Figure 8: Number of transactions scale-up
                700
                                                     AprioriTid
                600
                                                         Apriori
                                                  AprioriHybrid               Next, we examined how AprioriHybrid scaled up
                                                                           with the number of items. We increased the num-
                500                                                        ber of items from 1000 to 10,000 for the three pa-
                                                                           rameter settings T5.I2.D100K, T10.I4.D100K and
  Time (sec)




                400
                                                                           T20.I6.D100K. All other parameters were the same
                                                                           as for the data in Table 3. We ran experiments for a
                                                                           minimum support at 0.75%, and obtained the results
                300

                200                                                        shown in Figure 9. The execution times decreased a
                                                                           little since the average support for an item decreased
                100                                                        as we increased the number of items. This resulted
                                                                           in fewer large itemsets and, hence, faster execution
                  0                                                        times.
                                                                              Finally, we investigated the scale-up as we increased
                      2   1.5   1       0.75      0.5        0.33   0.25
                                    Minimum Support
                                                                           the average transaction size. The aim of this
               Figure 7: Execution times: AprioriHybrid                    experiment was to see how our data structures scaled
                                                                           with the transaction size, independent of other factors
                                                                           like the physical database size and the number of
                                                                           large itemsets. We kept the physical size of the
45                                                                30
                                                 T20.I6                                                               500
               40                                T10.I4                                                               750
                                                  T5.I2                          25                                  1000
               35

               30                                                                20
  Time (sec)




                                                                    Time (sec)
               25
                                                                                 15
               20

               15                                                                10

               10
                                                                                 5
                5

                0                                                                0
                1000   2500     5000          7500        10000                       5   10   20          30       40      50
                              Number of Items                                                    Transaction Size

               Figure 9: Number of items scale-up                                Figure 10: Transaction size scale-up

database roughly constant by keeping the product                  posed algorithms can be combined into a hybrid al-
of the average transaction size and the number of                 gorithm, called AprioriHybrid, which then becomes
transactions constant. The number of transactions                 the algorithm of choice for this problem. Scale-up ex-
ranged from 200,000 for the database with an average              periments showed that AprioriHybrid scales linearly
transaction size of 5 to 20,000 for the database with             with the number of transactions. In addition, the ex-
an average transaction size 50. Fixing the minimum                ecution time decreases a little as the number of items
support as a percentage would have led to large                   in the database increases. As the average transaction
increases in the number of large itemsets as the                  size increases (while keeping the database size con-
transaction size increased, since the probability of              stant), the execution time increases only gradually.
a itemset being present in a transaction is roughly               These experiments demonstrate the feasibility of us-
proportional to the transaction size. We therefore                ing AprioriHybrid in real applications involving very
  xed the minimum support level in terms of the                   large databases.
number of transactions. The results are shown in                     The algorithms presented in this paper have been
Figure 10. The numbers in the key (e.g. 500) refer                implemented on several data repositories, including
to this minimum support. As shown, the execution                  the AIX le system, DB2/MVS, and DB2/6000.
times increase with the transaction size, but only                We have also tested these algorithms against real
gradually. The main reason for the increase was that              customer data, the details of which can be found in
in spite of setting the minimum support in terms                   5]. In the future, we plan to extend this work along
of the number of transactions, the number of large                the following dimensions:
itemsets increased with increasing transaction length.                Multiple taxonomies (is-a hierarchies) over items
A secondary reason was that nding the candidates                      are often available. An example of such a
present in a transaction took a little longer time.                   hierarchy is that a dish washer is a kitchen
                                                                      appliance is a heavy electric appliance, etc. We
4 Conclusions and Future Work                                         would like to be able to nd association rules that
We presented two new algorithms, Apriori and Apri-                    use such hierarchies.
oriTid, for discovering all signi cant association rules              We did not consider the quantities of the items
between items in a large database of transactions.                    bought in a transaction, which are useful for some
We compared these algorithms to the previously                        applications. Finding such rules needs further
known algorithms, the AIS 4] and SETM 13] algo-                       work.
rithms. We presented experimental results, showing
that the proposed algorithms always outperform AIS                  The work reported in this paper has been done
and SETM. The performance gap increased with the                  in the context of the Quest project at the IBM Al-
problem size, and ranged from a factor of three for               maden Research Center. In Quest, we are exploring
small problems to more than an order of magnitude                 the various aspects of the database mining problem.
for large problems.                                               Besides the problem of discovering association rules,
   We showed how the best features of the two pro-                some other problems that we have looked into include
the enhancement of the database capability with clas-    10] D. H. Fisher. Knowledge acquisition via incre-
si cation queries 2] and similarity queries over time        mental conceptual clustering. Machine Learning,
sequences 1]. We believe that database mining is an          2(2), 1987.
important new application area for databases, com-
bining commercial interest with intriguing research      11] J. Han, Y. Cai, and N. Cercone. Knowledge
questions.                                                   discovery in databases: An attribute oriented
                                                             approach. In Proc. of the VLDB Conference,
                                                             pages 547{559, Vancouver, British Columbia,
Acknowledgment We wish to thank Mike Carey                   Canada, 1992.
for his insightful comments and suggestions.
                                                         12] M. Holsheimer and A. Siebes. Data mining: The
References                                                   search for knowledge in databases. Technical
                                                             Report CS-R9406, CWI, Netherlands, 1994.
 1] R. Agrawal, C. Faloutsos, and A. Swami. Ef-
     cient similarity search in sequence databases.      13] M. Houtsma and A. Swami. Set-oriented mining
    In Proc. of the Fourth International Conference          of association rules. Research Report RJ 9567,
      on Foundations of Data Organization and Algo-          IBM Almaden Research Center, San Jose, Cali-
      rithms, Chicago, October 1993.                         fornia, October 1993.
 2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and    14] R. Krishnamurthy and T. Imielinski. Practi-
    A. Swami. An interval classi er for database             tioner problems in need of database research: Re-
    mining applications. In Proc. of the VLDB                search directions in knowledge discovery. SIG-
    Conference, pages 560{573, Vancouver, British            MOD RECORD, 20(3):76{78, September 1991.
    Columbia, Canada, 1992.                              15] P. Langley, H. Simon, G. Bradshaw, and
 3] R. Agrawal, T. Imielinski, and A. Swami.                 J. Zytkow. Scienti c Discovery: Computational
    Database mining: A performance perspective.              Explorations of the Creative Process. MIT Press,
      IEEE Transactions on Knowledge and Data En-            1987.
      gineering, 5(6):914{925, December 1993. Special    16] H. Mannila and K.-J. Raiha. Dependency
      Issue on Learning and Discovery in Knowledge-          inference. In Proc. of the VLDB Conference,
      Based Databases.                                       pages 155{158, Brighton, England, 1987.
 4]   R. Agrawal, T. Imielinski, and A. Swami. Mining    17] H. Mannila, H. Toivonen, and A. I. Verkamo.
      association rules between sets of items in large       E cient algorithms for discovering association
      databases. In Proc. of the ACM SIGMOD Con-             rules. In KDD-94: AAAI Workshop on Knowl-
      ference on Management of Data, Washington,             edge Discovery in Databases, July 1994.
      D.C., May 1993.
                                                         18] S. Muggleton and C. Feng. E cient induction
 5]   R. Agrawal and R. Srikant. Fast algorithms for         of logic programs. In S. Muggleton, editor,
      mining association rules in large databases. Re-       Inductive Logic Programming. Academic Press,
      search Report RJ 9839, IBM Almaden Research            1992.
      Center, San Jose, California, June 1994.
                                                         19] J. Pearl. Probabilistic reasoning in intelligent
 6]   D. S. Associates. The new direct marketing.            systems: Networks of plausible inference, 1992.
      Business One Irwin, Illinois, 1990.
                                                         20] G. Piatestsky-Shapiro.        Discovery, analy-
 7]   R. Brachman et al. Integrated support for data         sis, and presentation of strong rules. In
      archeology. In AAAI-93 Workshop on Knowledge           G. Piatestsky-Shapiro, editor, Knowledge Dis-
      Discovery in Databases, July 1993.                     covery in Databases. AAAI/MIT Press, 1991.
 8]   L. Breiman, J. H. Friedman, R. A. Olshen, and      21] G. Piatestsky-Shapiro, editor. Knowledge Dis-
      C. J. Stone. Classi cation and Regression Trees.       covery in Databases. AAAI/MIT Press, 1991.
      Wadsworth, Belmont, 1984.                          22] J. R. Quinlan. C4.5: Programs for Machine
 9]   P. Cheeseman et al. Autoclass: A bayesian              Learning. Morgan Kaufman, 1993.
      classi cation system. In 5th Int'l Conf. on
      Machine Learning. Morgan Kaufman, June 1988.

Weitere ähnliche Inhalte

Was ist angesagt?

Data Mining For Supermarket Sale Analysis Using Association Rule
Data Mining For Supermarket Sale Analysis Using Association RuleData Mining For Supermarket Sale Analysis Using Association Rule
Data Mining For Supermarket Sale Analysis Using Association Ruleijtsrd
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns associationDeepaR42
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
An introduction to data mining and its techniques
An introduction to data mining and its techniquesAn introduction to data mining and its techniques
An introduction to data mining and its techniquesSandhya Tarwani
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...ijsrd.com
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
A Survey on Frequent Patterns To Optimize Association Rules
A Survey on Frequent Patterns To Optimize Association RulesA Survey on Frequent Patterns To Optimize Association Rules
A Survey on Frequent Patterns To Optimize Association RulesIRJET Journal
 
1.9.association mining 1
1.9.association mining 11.9.association mining 1
1.9.association mining 1Krish_ver2
 
Associations1
Associations1Associations1
Associations1mancnilu
 
Efficient Temporal Association Rule Mining
Efficient Temporal Association Rule MiningEfficient Temporal Association Rule Mining
Efficient Temporal Association Rule MiningIJMER
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 

Was ist angesagt? (18)

Data Mining For Supermarket Sale Analysis Using Association Rule
Data Mining For Supermarket Sale Analysis Using Association RuleData Mining For Supermarket Sale Analysis Using Association Rule
Data Mining For Supermarket Sale Analysis Using Association Rule
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns association
 
50120130405016 2
50120130405016 250120130405016 2
50120130405016 2
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
An introduction to data mining and its techniques
An introduction to data mining and its techniquesAn introduction to data mining and its techniques
An introduction to data mining and its techniques
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
 
Associative Learning
Associative LearningAssociative Learning
Associative Learning
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
A Survey on Frequent Patterns To Optimize Association Rules
A Survey on Frequent Patterns To Optimize Association RulesA Survey on Frequent Patterns To Optimize Association Rules
A Survey on Frequent Patterns To Optimize Association Rules
 
1.9.association mining 1
1.9.association mining 11.9.association mining 1
1.9.association mining 1
 
Associations1
Associations1Associations1
Associations1
 
Efficient Temporal Association Rule Mining
Efficient Temporal Association Rule MiningEfficient Temporal Association Rule Mining
Efficient Temporal Association Rule Mining
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Ijariie1129
Ijariie1129Ijariie1129
Ijariie1129
 
Ijtra130516
Ijtra130516Ijtra130516
Ijtra130516
 

Andere mochten auch

удалённое управление финансами
удалённое управление финансамиудалённое управление финансами
удалённое управление финансамиcoach-management
 
स्वाभिमान संकल्प पत्र
स्वाभिमान संकल्प पत्र स्वाभिमान संकल्प पत्र
स्वाभिमान संकल्प पत्र Abhyam Singh
 
Uganda's statistical abstract 2009
Uganda's statistical  abstract 2009Uganda's statistical  abstract 2009
Uganda's statistical abstract 2009Muganga Patrick
 
BlazeClan Technologies
BlazeClan TechnologiesBlazeClan Technologies
BlazeClan TechnologiesVaroon Rajani
 
Коуч-профессионал
Коуч-профессионалКоуч-профессионал
Коуч-профессионалcoach-management
 
Команда чемпионов
Команда чемпионовКоманда чемпионов
Команда чемпионовcoach-management
 

Andere mochten auch (7)

удалённое управление финансами
удалённое управление финансамиудалённое управление финансами
удалённое управление финансами
 
स्वाभिमान संकल्प पत्र
स्वाभिमान संकल्प पत्र स्वाभिमान संकल्प पत्र
स्वाभिमान संकल्प पत्र
 
Uganda's statistical abstract 2009
Uganda's statistical  abstract 2009Uganda's statistical  abstract 2009
Uganda's statistical abstract 2009
 
Tips for office.
Tips for office.Tips for office.
Tips for office.
 
BlazeClan Technologies
BlazeClan TechnologiesBlazeClan Technologies
BlazeClan Technologies
 
Коуч-профессионал
Коуч-профессионалКоуч-профессионал
Коуч-профессионал
 
Команда чемпионов
Команда чемпионовКоманда чемпионов
Команда чемпионов
 

Ähnlich wie Agrawal association rules

A literature review of modern association rule mining techniques
A literature review of modern association rule mining techniquesA literature review of modern association rule mining techniques
A literature review of modern association rule mining techniquesijctet
 
NEW ALGORITHM FOR SENSITIVE RULE HIDING USING DATA DISTORTION TECHNIQUE
NEW ALGORITHM FOR SENSITIVE RULE HIDING USING DATA DISTORTION TECHNIQUENEW ALGORITHM FOR SENSITIVE RULE HIDING USING DATA DISTORTION TECHNIQUE
NEW ALGORITHM FOR SENSITIVE RULE HIDING USING DATA DISTORTION TECHNIQUEcscpconf
 
A Performance Based Transposition algorithm for Frequent Itemsets Generation
A Performance Based Transposition algorithm for Frequent Itemsets GenerationA Performance Based Transposition algorithm for Frequent Itemsets Generation
A Performance Based Transposition algorithm for Frequent Itemsets GenerationWaqas Tariq
 
Introduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its MethodsIntroduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its MethodsIJSRD
 
Data Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using RData Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using RIRJET Journal
 
Intelligent Supermarket using Apriori
Intelligent Supermarket using AprioriIntelligent Supermarket using Apriori
Intelligent Supermarket using AprioriIRJET Journal
 
5 parallel implementation 06299286
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286Ninad Samel
 
An Optimal Approach to derive Disjunctive Positive and Negative Rules from As...
An Optimal Approach to derive Disjunctive Positive and Negative Rules from As...An Optimal Approach to derive Disjunctive Positive and Negative Rules from As...
An Optimal Approach to derive Disjunctive Positive and Negative Rules from As...IOSR Journals
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureIOSR Journals
 
Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Editor IJARCET
 
Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Editor IJARCET
 
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of  Apriori and Apriori with Hashing AlgorithmIRJET-Comparative Analysis of  Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of Apriori and Apriori with Hashing AlgorithmIRJET Journal
 
Paper id 212014126
Paper id 212014126Paper id 212014126
Paper id 212014126IJRAT
 
DISCOVERY OF ACTIONABLE PATTERNS THROUGH SCALABLE VERTICAL DATA SPLIT METHOD ...
DISCOVERY OF ACTIONABLE PATTERNS THROUGH SCALABLE VERTICAL DATA SPLIT METHOD ...DISCOVERY OF ACTIONABLE PATTERNS THROUGH SCALABLE VERTICAL DATA SPLIT METHOD ...
DISCOVERY OF ACTIONABLE PATTERNS THROUGH SCALABLE VERTICAL DATA SPLIT METHOD ...IJDKP
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET Journal
 
Output Privacy Protection With Pattern-Based Heuristic Algorithm
Output Privacy Protection With Pattern-Based Heuristic AlgorithmOutput Privacy Protection With Pattern-Based Heuristic Algorithm
Output Privacy Protection With Pattern-Based Heuristic Algorithmijcsit
 
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
An Effective Heuristic Approach for Hiding Sensitive Patterns in  DatabasesAn Effective Heuristic Approach for Hiding Sensitive Patterns in  Databases
An Effective Heuristic Approach for Hiding Sensitive Patterns in DatabasesIOSR Journals
 
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...ijsrd.com
 

Ähnlich wie Agrawal association rules (20)

A literature review of modern association rule mining techniques
A literature review of modern association rule mining techniquesA literature review of modern association rule mining techniques
A literature review of modern association rule mining techniques
 
NEW ALGORITHM FOR SENSITIVE RULE HIDING USING DATA DISTORTION TECHNIQUE
NEW ALGORITHM FOR SENSITIVE RULE HIDING USING DATA DISTORTION TECHNIQUENEW ALGORITHM FOR SENSITIVE RULE HIDING USING DATA DISTORTION TECHNIQUE
NEW ALGORITHM FOR SENSITIVE RULE HIDING USING DATA DISTORTION TECHNIQUE
 
A Performance Based Transposition algorithm for Frequent Itemsets Generation
A Performance Based Transposition algorithm for Frequent Itemsets GenerationA Performance Based Transposition algorithm for Frequent Itemsets Generation
A Performance Based Transposition algorithm for Frequent Itemsets Generation
 
Introduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its MethodsIntroduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its Methods
 
Data Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using RData Mining Apriori Algorithm Implementation using R
Data Mining Apriori Algorithm Implementation using R
 
Intelligent Supermarket using Apriori
Intelligent Supermarket using AprioriIntelligent Supermarket using Apriori
Intelligent Supermarket using Apriori
 
5 parallel implementation 06299286
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286
 
F0423038041
F0423038041F0423038041
F0423038041
 
An Optimal Approach to derive Disjunctive Positive and Negative Rules from As...
An Optimal Approach to derive Disjunctive Positive and Negative Rules from As...An Optimal Approach to derive Disjunctive Positive and Negative Rules from As...
An Optimal Approach to derive Disjunctive Positive and Negative Rules from As...
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining Procedure
 
Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084
 
Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084
 
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of  Apriori and Apriori with Hashing AlgorithmIRJET-Comparative Analysis of  Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
 
Paper id 212014126
Paper id 212014126Paper id 212014126
Paper id 212014126
 
DISCOVERY OF ACTIONABLE PATTERNS THROUGH SCALABLE VERTICAL DATA SPLIT METHOD ...
DISCOVERY OF ACTIONABLE PATTERNS THROUGH SCALABLE VERTICAL DATA SPLIT METHOD ...DISCOVERY OF ACTIONABLE PATTERNS THROUGH SCALABLE VERTICAL DATA SPLIT METHOD ...
DISCOVERY OF ACTIONABLE PATTERNS THROUGH SCALABLE VERTICAL DATA SPLIT METHOD ...
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
 
Output Privacy Protection With Pattern-Based Heuristic Algorithm
Output Privacy Protection With Pattern-Based Heuristic AlgorithmOutput Privacy Protection With Pattern-Based Heuristic Algorithm
Output Privacy Protection With Pattern-Based Heuristic Algorithm
 
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
An Effective Heuristic Approach for Hiding Sensitive Patterns in  DatabasesAn Effective Heuristic Approach for Hiding Sensitive Patterns in  Databases
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
 
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
 
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
 

Kürzlich hochgeladen

4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 

Kürzlich hochgeladen (20)

Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 

Agrawal association rules

  • 1. Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 Abstract tires and auto accessories also get automotive services We consider the problem of discovering association rules done. Finding all such rules is valuable for cross- between items in a large database of sales transactions. marketing and attached mailing applications. Other We present two new algorithms for solving this problem applications include catalog design, add-on sales, that are fundamentally di erent from the known algo- store layout, and customer segmentation based on rithms. Empirical evaluation shows that these algorithms buying patterns. The databases involved in these outperform the known algorithms by factors ranging from applications are very large. It is imperative, therefore, three for small problems to more than an order of mag- to have fast algorithms for this task. nitude for large problems. We also show how the best features of the two proposed algorithms can be combined The following is a formal statement of the problem into a hybrid algorithm, called AprioriHybrid. Scale-up 4]: Let I = fi1 i2 . . . im g be a set of literals, experiments show that AprioriHybrid scales linearly with called items. Let D be a set of transactions, where the number of transactions. AprioriHybrid also has ex- each transaction T is a set of items such that T cellent scale-up properties with respect to the transaction I . Associated with each transaction is a unique size and the number of items in the database. identi er, called its TID. We say that a transaction T contains X, a set of some items in I , if X T . 1 Introduction An association rule is an implication of the form Progress in bar-code technology has made it possi- X =) Y , where X I , Y I , and X Y = . ble for retail organizations to collect and store mas- The rule X =) Y holds in the transaction set D with sive amounts of sales data, referred to as the basket con dence c if c% of transactions in D that contain data. A record in such data typically consists of the X also contain Y . The rule X =) Y has support s transaction date and the items bought in the trans- in the transaction set D if s% of transactions in D action. Successful organizations view such databases contain X Y . Our rules are somewhat more general as important pieces of the marketing infrastructure. than in 4] in that we allow a consequent to have more They are interested in instituting information-driven than one item. marketing processes, managed by database technol- Given a set of transactions D, the problem of min- ogy, that enable marketers to develop and implement ing association rules is to generate all association rules customized marketing programs and strategies 6]. that have support and con dence greater than the The problem of mining association rules over basket user-speci ed minimum support (called minsup ) and data was introduced in 4]. An example of such a minimum con dence (called minconf ) respectively. rule might be that 98% of customers that purchase Our discussion is neutral with respect to the repre- Visiting from the Department of Computer Science, Uni- sentation of D. For example, D could be a data le, versity of Wisconsin, Madison. a relational table, or the result of a relational expres- Permission to copy without fee all or part of this material sion. is granted provided that the copies are not made or distributed An algorithm for nding all association rules, for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice henceforth referred to as the AIS algorithm, was pre- is given that copying is by permission of the Very Large Data sented in 4]. Another algorithm for this task, called Base Endowment. To copy otherwise, or to republish, requires the SETM algorithm, has been proposed in 13]. In a fee and/or special permission from the Endowment. this paper, we present two new algorithms, Apriori Proceedings of the 20th VLDB Conference and AprioriTid, that di er fundamentally from these Santiago, Chile, 1994 algorithms. We present experimental results showing
  • 2. that the proposed algorithms always outperform the for an itemset is the number of transactions that earlier algorithms. The performance gap is shown to contain the itemset. Itemsets with minimum sup- increase with problem size, and ranges from a fac- port are called large itemsets, and all others small tor of three for small problems to more than an or- itemsets. In Section 2, we give new algorithms, der of magnitude for large problems. We then dis- Apriori and AprioriTid, for solving this problem. cuss how the best features of Apriori and Apriori- Tid can be combined into a hybrid algorithm, called 2. Use the large itemsets to generate the desired rules. AprioriHybrid. Experiments show that the Apriori- Here is a straightforward algorithm for this task. Hybrid has excellent scale-up properties, opening up For every large itemset l, nd all non-empty subsets the feasibility of mining association rules over very of l. For every such subset a, output a rule of large databases. the form a =) (l ; a) if the ratio of support(l) The problem of nding association rules falls within to support(a) is at least minconf. We need to the purview of database mining 3] 12], also called consider all subsets of l to generate rules with knowledge discovery in databases 21]. Related, but multiple consequents. Due to lack of space, we do not directly applicable, work includes the induction not discuss this subproblem further, but refer the of classi cation rules 8] 11] 22], discovery of causal reader to 5] for a fast algorithm. rules 19], learning of logical de nitions 18], tting of functions to data 15], and clustering 9] 10]. The In Section 3, we show the relative performance closest work in the machine learning literature is the of the proposed Apriori and AprioriTid algorithms KID3 algorithm presented in 20]. If used for nding against the AIS 4] and SETM 13] algorithms. all association rules, this algorithm will make as many To make the paper self-contained, we include an passes over the data as the number of combinations overview of the AIS and SETM algorithms in this of items in the antecedent, which is exponentially section. We also describe how the Apriori and large. Related work in the database literature is AprioriTid algorithms can be combined into a hybrid the work on inferring functional dependencies from algorithm, AprioriHybrid, and demonstrate the scale- data 16]. Functional dependencies are rules requiring up properties of this algorithm. We conclude by strict satisfaction. Consequently, having determined pointing out some related open problems in Section 4. a dependency X ! A, the algorithms in 16] consider any other dependency of the form X + Y ! A 2 Discovering Large Itemsets redundant and do not generate it. The association Algorithms for discovering large itemsets make mul- rules we consider are probabilistic in nature. The tiple passes over the data. In the rst pass, we count presence of a rule X ! A does not necessarily mean the support of individual items and determine which that X + Y ! A also holds because the latter may of them are large, i.e. have minimumsupport. In each not have minimumsupport. Similarly, the presence of subsequent pass, we start with a seed set of itemsets rules X ! Y and Y ! Z does not necessarily mean found to be large in the previous pass. We use this that X ! Z holds because the latter may not have seed set for generating new potentially large itemsets, minimum con dence. called candidate itemsets, and count the actual sup- There has been work on quantifying the useful- port for these candidate itemsets during the pass over ness" or interestingness" of a rule 20]. What is use- the data. At the end of the pass, we determine which ful or interesting is often application-dependent. The of the candidate itemsets are actually large, and they need for a human in the loop and providing tools to become the seed for the next pass. This process con- allow human guidance of the rule discovery process tinues until no new large itemsets are found. has been articulated, for example, in 7] 14]. We do The Apriori and AprioriTid algorithms we propose not discuss these issues in this paper, except to point di er fundamentally from the AIS 4] and SETM 13] out that these are necessary features of a rule discov- algorithms in terms of which candidate itemsets are ery system that may use our algorithms as the engine counted in a pass and in the way that those candidates of the discovery process. are generated. In both the AIS and SETM algorithms, 1.1 Problem Decomposition and Paper candidate itemsets are generated on-the- y during the Organization pass as data is being read. Speci cally, after reading The problem of discovering all association rules can a transaction, it is determined which of the itemsets be decomposed into two subproblems 4]: found large in the previous pass are present in the transaction. New candidate itemsets are generated by 1. Find all sets of items (itemsets) that have transac- extending these large itemsets with other items in the tion support above minimum support. The support transaction. However, as we will see, the disadvantage
  • 3. is that this results in unnecessarily generating and Table 1: Notation counting too many candidate itemsets that turn out to be small. -itemset An itemset having k items. The Apriori and AprioriTid algorithms generate k Set of large k-itemsets the candidate itemsets to be counted in a pass by Lk (those with minimum support). using only the itemsets found large in the previous Each member of this set has two elds: pass { without considering the transactions in the i) itemset and ii) support count. database. The basic intuition is that any subset Set of candidate k-itemsets of a large itemset must be large. Therefore, the Ck (potentially large itemsets). candidate itemsets having k items can be generated Each member of this set has two elds: by joining large itemsets having k ; 1 items, and i) itemset and ii) support count. deleting those that contain any subset that is not Set of candidate k-itemsets when the TIDs large. This procedure results in generation of a much Ck of the generating transactions are kept smaller number of candidate itemsets. associated with the candidates. The AprioriTid algorithm has the additional prop- erty that the database is not used at all for count- given transaction t. Section 2.1.2 describes the subset ing the support of candidate itemsets after the rst function used for this purpose. See 5] for a discussion pass. Rather, an encoding of the candidate itemsets of bu er management. used in the previous pass is employed for this purpose. In later passes, the size of this encoding can become 1) L1 = flarge 1-itemsetsg much smaller than the database, thus saving much 2) for ( k = 2 Lk;1 6= k++ ) do begin reading e ort. We will explain these points in more 3) Ck = apriori-gen(Lk;1 ) // New candidates detail when we describe the algorithms. 4) forall transactions t 2 D do begin Notation We assume that items in each transaction 5) Ct = subset(Ck , t) // Candidates contained in t 6) forall candidates c 2 Ct do are kept sorted in their lexicographic order. It is 7) c:count++ straightforward to adapt these algorithms to the case 8) end where the database D is kept normalized and each 9) Lk = fc 2 Ck j c:count minsup g database record is a <TID, item> pair, where TID is the identi er of the corresponding transaction. 10) end 11) Answer = k Lk S We call the number of items in an itemset its size, and call an itemset of size k a k-itemset. Items within Figure 1: Algorithm Apriori an itemset are kept in lexicographic order. We use the notation c 1] c 2] . . . c k] to represent a k- itemset c consisting of items c 1] c 2] . . .c k], where 2.1.1 Apriori Candidate Generation c 1] < c 2] < . . . < c k]. If c = X Y and Y The apriori-gen function takes as argument Lk;1, is an m-itemset, we also call Y an m-extension of the set of all large (k ; 1)-itemsets. It returns a X. Associated with each itemset is a count eld to superset of the set of all large k-itemsets. The store the support for this itemset. The count eld is function works as follows. 1 First, in the join step, initialized to zero when the itemset is rst created. we join Lk;1 with Lk;1: We summarize in Table 1 the notation used in the insert into k algorithms. The set C k is used by AprioriTid and will C select .item1 , .item2 , ..., .itemk;1 , .itemk;1 be further discussed when we describe this algorithm. p p p q from k;1 , k;1 L p L q 2.1 Algorithm Apriori where .item1 = .item1 , . . ., .itemk;2 = .itemk;2 , p q p q .itemk;1 < q.itemk;1 Figure 1 gives the Apriori algorithm. The rst pass p of the algorithm simply counts item occurrences to Next, in the prune step, we delete all itemsets c 2 Ck determine the large 1-itemsets. A subsequent pass, such that some (k ; 1)-subset of c is not in Lk;1: say pass k, consists of two phases. First, the large 1 Concurrent to our work, the following two-step candidate itemsets Lk;1 found in the (k ; 1)th pass are used to generation procedure has been proposed in 17]: generate the candidate itemsets Ck , using the apriori- 0 k =f 0j 0 2 k;1 j 0 j = ; 2g gen function described in Section 2.1.1. Next, the C X X X X L X X k 0 k = f 2 k j contains members of k ;1 g database is scanned and the support of candidates in C X C X k L These two steps are similar to our join and prune steps Ck is counted. For fast counting, we need to e ciently respectively. However, in general, step 1 would produce a determine the candidates in Ck that are contained in a superset of the candidates produced by our join step.
  • 4. forall itemsets 2 k do c C later passes when the cost of counting and keeping in forall ( ; 1)-subsets of do k s c 0 memory additional Ck+1 ; Ck+1 candidates becomes if ( 62 k;1 ) then s L less than the cost of scanning the database. delete from k c C 2.1.2 Subset Function Candidate itemsets Ck are stored in a hash-tree. A Example Let L3 be ff1 2 3g, f1 2 4g, f1 3 4g, f1 node of the hash-tree either contains a list of itemsets 3 5g, f2 3 4gg. After the join step, C4 will be ff1 2 3 (a leaf node) or a hash table (an interior node). In an 4g, f1 3 4 5g g. The prune step will delete the itemset interior node, each bucket of the hash table points to f1 3 4 5g because the itemset f1 4 5g is not in L3 . another node. The root of the hash-tree is de ned to We will then be left with only f1 2 3 4g in C4. be at depth 1. An interior node at depth d points to Contrast this candidate generation with the one nodes at depth d+1. Itemsets are stored in the leaves. used in the AIS and SETM algorithms. In pass k When we add an itemset c, we start from the root and of these algorithms, a database transaction t is read go down the tree until we reach a leaf. At an interior and it is determined which of the large itemsets in node at depth d, we decide which branch to follow Lk;1 are present in t. Each of these large itemsets by applying a hash function to the dth item of the l is then extended with all those large items that itemset. All nodes are initially created as leaf nodes. are present in t and occur later in the lexicographic When the number of itemsets in a leaf node exceeds ordering than any of the items in l. Continuing with a speci ed threshold, the leaf node is converted to an the previous example, consider a transaction f1 2 interior node. 3 4 5g. In the fourth pass, AIS and SETM will Starting from the root node, the subset function generate two candidates, f1 2 3 4g and f1 2 3 5g, nds all the candidates contained in a transaction by extending the large itemset f1 2 3g. Similarly, an t as follows. If we are at a leaf, we nd which of additional three candidate itemsets will be generated the itemsets in the leaf are contained in t and add by extending the other large itemsets in L3 , leading references to them to the answer set. If we are at an to a total of 5 candidates for consideration in the interior node and we have reached it by hashing the fourth pass. Apriori, on the other hand, generates item i, we hash on each item that comes after i in t and counts only one itemset, f1 3 4 5g, because it and recursively apply this procedure to the node in concludes a priori that the other combinations cannot the corresponding bucket. For the root node, we hash possibly have minimum support. on every item in t. To see why the subset function returns the desired Correctness We need to show that Ck Lk . set of references, consider what happens at the root Clearly, any subset of a large itemset must also node. For any itemset c contained in transaction t, the have minimum support. Hence, if we extended each rst item of c must be in t. At the root, by hashing on itemset in Lk;1 with all possible items and then every item in t, we ensure that we only ignore itemsets deleted all those whose (k ; 1)-subsets were not in that start with an item not in t. Similar arguments Lk;1, we would be left with a superset of the itemsets apply at lower depths. The only additional factor is in Lk . that, since the items in any itemset are ordered, if we The join is equivalent to extending Lk;1 with each reach the current node by hashing the item i, we only item in the database and then deleting those itemsets need to consider the items in t that occur after i. for which the (k ;1)-itemset obtained by deleting the (k;1)th item is not in Lk;1 . The condition p.itemk;1 2.2 Algorithm AprioriTid < q.itemk;1 simply ensures that no duplicates are The AprioriTid algorithm, shown in Figure 2, also generated. Thus, after the join step, Ck Lk . By uses the apriori-gen function (given in Section 2.1.1) similar reasoning, the prune step, where we delete to determine the candidate itemsets before the pass from Ck all itemsets whose (k ; 1)-subsets are not in begins. The interesting feature of this algorithm is Lk;1, also does not delete any itemset that could be that the database D is not used for counting support in Lk . after the rst pass. Rather, the set C k is used for this purpose. Each member of the set C k is Variation: Counting Candidates of Multiple of the form < TID fXk g >, where each Xk is a Sizes in One Pass Rather than counting only potentially large k-itemset present in the transaction candidates of size k in the kth pass, we can also with identi er TID. For k = 1, C 1 corresponds to 0 0 count the candidates Ck+1, where Ck+1 is generated the database D, although conceptually each item i 0 from Ck , etc. Note that Ck+1 Ck+1 since Ck+1 is is replaced by the itemset fig. For k > 1, C k is generated from Lk . This variation can pay o in the generated by the algorithm (step 10). The member
  • 5. of C k corresponding to transaction t is <t:T ID, we generate C4 using L3 , it turns out to be empty, fc 2 Ck jc contained in tg>. If a transaction does and we terminate. not contain any candidate k-itemset, then C k will not have an entry for this transaction. Thus, the Database C1 number of entries in C k may be smaller than the TID Items TID Set-of-Itemsets number of transactions in the database, especially for 100 1 3 4 100 f f1g, f3g, f4g g large values of k. In addition, for large values of k, 200 2 3 5 200 f f2g, f3g, f5g g each entry may be smaller than the corresponding 300 1 2 3 5 300 f f1g, f2g, f3g, f5g g transaction because very few candidates may be 400 2 5 400 f f2g, f5g g contained in the transaction. However, for small C2 values for k, each entry may be larger than the L1 Itemset Support corresponding transaction because an entry in Ck Itemset Support f1 2g 1 includes all candidate k-itemsets contained in the f1g 2 f1 3g 2 transaction. f2g 3 f1 5g 1 In Section 2.2.1, we give the data structures used f3g 3 f2 3g 2 to implement the algorithm. See 5] for a proof of f5g 3 f2 5g 3 correctness and a discussion of bu er management. f3 5g 2 1) L1 = flarge 1-itemsetsg TID C2 Set-of-Itemsets L2 2) C 1 = database D 100 f f1 3g g Itemset Support 3) for ( k = 2 Lk;1 6= k++ ) do begin 200 f f2 3g, f2 5g, f3 5g g f1 3g 2 4) Ck = apriori-gen(Lk;1 ) // New candidates 300 f f1 2g, f1 3g, f1 5g, f2 3g 2 5) C k = f2 3g, f2 5g, f3 5g g f2 5g 3 6) forall entries t 2 C k;1 do begin 400 f f2 5g g f3 5g 2 7) // determine candidate itemsets in Ck contained // in the transaction with identi er t.TID Ct = fc 2 Ck j (c ; c k ]) 2 t:set-of-itemsets ^ C3 (c ; c k ; 1]) 2 t.set-of-itemsetsg C3 Itemset Support TID Set-of-Itemsets 8) forall candidates c 2 Ct do f2 3 5g 2 200 f f2 3 5g g 9) c:count++ 300 f f2 3 5g g 10) if (Ct 6= ) then C k += < t:TID Ct > 11) end L3 12) Lk = fc 2 Ck j c:count minsup g Itemset Support f2 3 5g 13) end S 14) Answer = k Lk 2 Figure 3: Example Figure 2: Algorithm AprioriTid 2.2.1 Data Structures We assign each candidate itemset a unique number, Example Consider the database in Figure 3 and called its ID. Each set of candidate itemsets Ck is kept assume that minimum support is 2 transactions. in an array indexed by the IDs of the itemsets in Ck . Calling apriori-gen with L1 at step 4 gives the A member of C k is now of the form < TID fIDg >. candidate itemsets C2. In steps 6 through 10, we Each C k is stored in a sequential structure. count the support of candidates in C2 by iterating over The apriori-gen function generates a candidate k- the entries in C 1 and generate C 2. The rst entry in itemset ck by joining two large (k ; 1)-itemsets. We C 1 is f f1g f3g f4g g, corresponding to transaction maintain two additional elds for each candidate 100. The Ct at step 7 corresponding to this entry t itemset: i) generators and ii) extensions. The is f f1 3g g, because f1 3g is a member of C2 and generators eld of a candidate itemset ck stores the both (f1 3g - f1g) and (f1 3g - f3g) are members of IDs of the two large (k ; 1)-itemsets whose join t.set-of-itemsets. generated ck . The extensions eld of an itemset Calling apriori-gen with L2 gives C3 . Making a pass ck stores the IDs of all the (k + 1)-candidates that over the data with C 2 and C3 generates C 3. Note that are extensions of ck . Thus, when a candidate ck is there is no entry in C 3 for the transactions with TIDs 1 2 generated by joining lk;1 and lk;1, we save the IDs 100 and 400, since they do not contain any of the 1 and l2 in the generators eld for ck . At the of lk;1 k;1 itemsets in C3 . The candidate f2 3 5g in C3 turns same time, the ID of ck is added to the extensions out to be large and is the only member of L3. When 1 eld of lk;1.
  • 6. We now describe how Step 7 of Figure 2 is It thus generates and counts every candidate itemset implemented using the above data structures. Recall that the AIS algorithm generates. However, to use the that the t.set-of-itemsets eld of an entry t in C k;1 standard SQL join operation for candidate generation, gives the IDs of all (k ; 1)-candidates contained in SETM separates candidate generation from counting. transaction t.TID. For each such candidate ck;1 the It saves a copy of the candidate itemset together with extensions eld gives Tk , the set of IDs of all the the TID of the generating transaction in a sequential candidate k-itemsets that are extensions of ck;1. For structure. At the end of the pass, the support count each ck in Tk , the generators eld gives the IDs of of candidate itemsets is determined by sorting and the two itemsets that generated ck . If these itemsets aggregating this sequential structure. are present in the entry for t.set-of-itemsets, we can SETM remembers the TIDs of the generating conclude that ck is present in transaction t.TID, and transactions with the candidate itemsets. To avoid add ck to Ct. needing a subset operation, it uses this information to determine the large itemsets contained in the 3 Performance transaction read. Lk C k and is obtained by deleting To assess the relative performance of the algorithms those candidates that do not have minimum support. for discovering large sets, we performed several Assuming that the database is sorted in TID order, experiments on an IBM RS/6000 530H workstation SETM can easily nd the large itemsets contained in a with a CPU clock rate of 33 MHz, 64 MB of main transaction in the next pass by sorting Lk on TID. In memory, and running AIX 3.2. The data resided in fact, it needs to visit every member of Lk only once in the AIX le system and was stored on a 2GB SCSI the TID order, and the candidate generation can be 3.5" drive, with measured sequential throughput of performed using the relational merge-join operation about 2 MB/second. 13]. We rst give an overview of the AIS 4] and SETM The disadvantage of this approach is mainly due 13] algorithms against which we compare the per- to the size of candidate sets C k . For each candidate formance of the Apriori and AprioriTid algorithms. itemset, the candidate set now has as many entries We then describe the synthetic datasets used in the as the number of transactions in which the candidate performance evaluation and show the performance re- itemset is present. Moreover, when we are ready to sults. Finally, we describe how the best performance count the support for candidate itemsets at the end features of Apriori and AprioriTid can be combined of the pass, C k is in the wrong order and needs to be into an AprioriHybrid algorithm and demonstrate its sorted on itemsets. After counting and pruning out scale-up properties. small candidate itemsets that do not have minimum support, the resulting set Lk needs another sort on 3.1 The AIS Algorithm TID before it can be used for generating candidates Candidate itemsets are generated and counted on- in the next pass. the- y as the database is scanned. After reading a transaction, it is determined which of the itemsets 3.3 Generation of Synthetic Data that were found to be large in the previous pass are We generated synthetic transactions to evaluate the contained in this transaction. New candidate itemsets performance of the algorithms over a large range of are generated by extending these large itemsets with data characteristics. These transactions mimic the other items in the transaction. A large itemset l is transactions in the retailing environment. Our model extended with only those items that are large and of the real" world is that people tend to buy sets occur later in the lexicographic ordering of items than of items together. Each such set is potentially a any of the items in l. The candidates generated maximal large itemset. An example of such a set from a transaction are added to the set of candidate might be sheets, pillow case, comforter, and ru es. itemsets maintained for the pass, or the counts of However, some people may buy only some of the the corresponding entries are increased if they were items from such a set. For instance, some people created by an earlier transaction. See 4] for further might buy only sheets and pillow case, and some only details of the AIS algorithm. sheets. A transaction may contain more than one large itemset. For example, a customer might place an 3.2 The SETM Algorithm order for a dress and jacket when ordering sheets and The SETM algorithm 13] was motivated by the desire pillow cases, where the dress and jacket together form to use SQL to compute large itemsets. Like AIS, another large itemset. Transaction sizes are typically the SETM algorithm also generates candidates on- clustered around a mean and a few transactions have the- y based on transactions read from the database. many items. Typical sizes of large itemsets are also
  • 7. clustered around a mean, with a few large itemsets probability of picking the associated itemset. having a large number of items. To model the phenomenon that all the items in To create a dataset, our synthetic data generation a large itemset are not always bought together, we program takes the parameters shown in Table 2. assign each itemset in T a corruption level c. When adding an itemset to a transaction, we keep dropping Table 2: Parameters an item from the itemset as long as a uniformly distributed random number between 0 and 1 is less j j Number of transactions D than c. Thus for an itemset of size l, we will add l j j Average size of the transactions T items to the transaction 1 ; c of the time, l ; 1 items j j Average size of the maximal potentially I c(1 ; c) of the time, l ; 2 items c2(1 ; c) of the time, large itemsets etc. The corruption level for an itemset is xed and j j Number of maximal potentially large itemsets L is obtained from a normal distribution with mean 0.5 N Number of items and variance 0.1. We rst determine the size of the next transaction. We generated datasets by setting N = 1000 and jLj The size is picked from a Poisson distribution with = 2000. We chose 3 values for jT j: 5, 10, and 20. We mean equal to jT j. Note that if each item is chosen also chose 3 values for jI j: 2, 4, and 6. The number of with the same probability p, and there are N items, transactions was to set to 100,000 because, as we will the expected number of items in a transaction is given see in Section 3.4, SETM could not be run for larger by a binomial distribution with parameters N and p, values. However, for our scale-up experiments, we and is approximated by a Poisson distribution with generated datasets with up to 10 million transactions mean Np. (838MB for T20). Table 3 summarizes the dataset We then assign items to the transaction. Each parameter settings. For the same jT j and jDj values, transaction is assigned a series of potentially large the size of datasets in megabytes were roughly equal itemsets. If the large itemset on hand does not t in for the di erent values of jI j. the transaction, the itemset is put in the transaction anyway in half the cases, and the itemset is moved to Table 3: Parameter settings the next transaction the rest of the cases. Large itemsets are chosen from a set T of such Name jT j jI j jDj Size in Megabytes itemsets. The number of itemsets in T is set to T5.I2.D100K 5 2 100K 2.4 jLj. There is an inverse relationship between jLj and T10.I2.D100K 10 2 100K 4.4 T10.I4.D100K 10 4 100K the average support for potentially large itemsets. T20.I2.D100K 20 2 100K 8.4 An itemset in T is generated by rst picking the T20.I4.D100K 20 4 100K size of the itemset from a Poisson distribution with T20.I6.D100K 20 6 100K mean equal to jI j. Items in the rst itemset are chosen randomly. To model the phenomenon 3.4 Relative Performance that large itemsets often have common items, some fraction of items in subsequent itemsets are chosen Figure 4 shows the execution times for the six from the previous itemset generated. We use an synthetic datasets given in Table 3 for decreasing exponentially distributed random variable with mean values of minimum support. As the minimum support equal to the correlation level to decide this fraction decreases, the execution times of all the algorithms for each itemset. The remaining items are picked at increase because of increases in the total number of random. In the datasets used in the experiments, candidate and large itemsets. the correlation level was set to 0.5. We ran some For SETM, we have only plotted the execution experiments with the correlation level set to 0.25 and times for the dataset T5.I2.D100K in Figure 4. The 0.75 but did not nd much di erence in the nature of execution times for SETM for the two datasets with our performance results. an average transaction size of 10 are given in Table 4. Each itemset in T has a weight associated with We did not plot the execution times in Table 4 it, which corresponds to the probability that this on the corresponding graphs because they are too itemset will be picked. This weight is picked from large compared to the execution times of the other an exponential distribution with unit mean, and is algorithms. For the three datasets with transaction then normalized so that the sum of the weights for all sizes of 20, SETM took too long to execute and the itemsets in T is 1. The next itemset to be put we aborted those runs as the trends were clear. in the transaction is chosen from T by tossing an jLj- Clearly, Apriori beats SETM by more than an order sided weighted coin, where the weight for a side is the of magnitude for large datasets.
  • 8. T5.I2.D100K T10.I2.D100K 80 160 SETM AIS 70 AIS 140 AprioriTid AprioriTid Apriori Apriori 60 120 50 100 Time (sec) Time (sec) 40 80 30 60 20 40 10 20 0 0 2 1.5 1 0.75 0.5 0.33 0.25 2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support Minimum Support T10.I4.D100K T20.I2.D100K 350 1000 AIS AIS AprioriTid 900 AprioriTid 300 Apriori Apriori 800 250 700 600 Time (sec) Time (sec) 200 500 150 400 100 300 200 50 100 0 0 2 1.5 1 0.75 0.5 0.33 0.25 2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support Minimum Support T20.I4.D100K T20.I6.D100K 1800 3500 AIS AIS 1600 AprioriTid AprioriTid Apriori 3000 Apriori 1400 2500 1200 Time (sec) Time (sec) 1000 2000 800 1500 600 1000 400 500 200 0 0 2 1.5 1 0.75 0.5 0.33 0.25 2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support Minimum Support Figure 4: Execution times
  • 9. Table 4: Execution times in seconds for SETM SETM algorithm to perform poorly.2 This explains the jump in time for SETM in Table 4 when going Algorithm Minimum Support from 1.5% support to 1.0% support for datasets with 2.0% 1.5% 1.0% 0.75% 0.5% transaction size 10. The largest dataset in the scale- Dataset T10.I2.D100K up experiments for SETM in 13] was still small SETM 74 161 838 1262 1878 enough that C k could t in memory hence they did Apriori 4.4 5.3 11.0 14.5 15.3 not encounter this jump in execution time. Note that Dataset T10.I4.D100K for the same minimum support, the support count for SETM 41 91 659 929 1639 candidate itemsets increases linearly with the number Apriori 3.8 4.8 11.2 17.4 19.3 of transactions. Thus, as we increase the number of transactions for the same values of jT j and jI j, though Apriori beats AIS for all problem sizes, by factors the size of Ck does not change, the size of C k goes up ranging from 2 for high minimum support to more linearly. Thus, for datasets with more transactions, than an order of magnitude for low levels of support. the performance gap between SETM and the other AIS always did considerably better than SETM. For algorithms will become even larger. small problems, AprioriTid did about as well as The problem with AIS is that it generates too many Apriori, but performance degraded to about twice as candidates that later turn out to be small, causing slow for large problems. it to waste too much e ort. Apriori also counts too many small sets in the second pass (recall that C2 is 3.5 Explanation of the Relative Performance really a cross-product of L1 with L1 ). However, this To explain these performance trends, we show in wastage decreases dramatically from the third pass Figure 5 the sizes of the large and candidate sets in onward. Note that for the example in Figure 5, after di erent passes for the T10.I4.D100K dataset for the pass 3, almost every candidate itemset counted by minimum support of 0.75%. Note that the Y-axis in Apriori turns out to be a large set. this graph has a log scale. AprioriTid also has the problem of SETM that C k tends to be large. However, the apriori candidate 1e+07 generation used by AprioriTid generates signi cantly fewer candidates than the transaction-based candi- C-k-m (SETM) C-k-m (AprioriTid) date generation used by SETM. As a result, the C k of 1e+06 C-k (AIS, SETM) C-k (Apriori, AprioriTid) 100000 L-k AprioriTid has fewer entries than that of SETM. Apri- oriTid is also able to use a single word (ID) to store Number of Itemsets 10000 a candidate rather than requiring as many words as the number of items in the candidate.3 In addition, 1000 unlike SETM, AprioriTid does not have to sort C k . Thus, AprioriTid does not su er as much as SETM 100 from maintaining C k . 10 AprioriTid has the nice feature that it replaces a pass over the original dataset by a pass over the set 1 C k . Hence, AprioriTid is very e ective in later passes 1 2 3 4 5 6 7 when the size of C k becomes small compared to the Figure 5: Sizes of the large and candidate sets Pass Number 2 The cost of external sorting in SETM can be reduced (T10.I4.D100K, minsup = 0.75%) somewhat as follows. Before writing out entries in k to C The fundamental problem with the SETM algo- disk, we can sort them on itemsets using an internal sorting rithm is the size of its C k sets. Recall that the size of procedure, and write them as sorted runs. These sorted runs can then be merged to obtain support counts. However, the set C k is given by given the poor performance of SETM, we do not expect this X support-count(c): optimization to a ect the algorithm choice. 3 For SETM to use IDs, it would have to maintain two additional in-memory data structures: a hash table to nd candidate itemsets c out whether a candidate has been generated previously, and a mapping from the IDs to candidates. However, this would Thus, the sets C k are roughly S times bigger than the destroy the set-oriented nature of the algorithm. Also, once we corresponding Ck sets, where S is the average support have the hash table which gives us the IDs of candidates, we count of the candidate itemsets. Unless the problem might as well count them at the same time and avoid the two external sorts. We experimentedwith this variant of SETM and size is very small, the C k sets have to be written found that, while it did better than SETM, it still performed to disk, and externally sorted twice, causing the much worse than Apriori or AprioriTid.
  • 10. size of the database. Thus, we nd that AprioriTid in Ck . From this, we estimate what the size of C k beats Apriori when its C k sets can t in memory and the distribution of the large itemsets has a long tail. P would have been if it had been generated. This size, in words, is ( candidates c 2 Ck support(c) + When C k doesn't t in memory, there is a jump in number of transactions). If C k in this pass was small the execution time for AprioriTid, such as when going enough to t in memory, and there were fewer large from 0.75% to 0.5% for datasets with transaction size candidates in the current pass than the previous pass, 10 in Figure 4. In this region, Apriori starts beating we switch to AprioriTid. The latter condition is added AprioriTid. to avoid switching when C k in the current pass ts in memory but C k in the next pass may not. 3.6 Algorithm AprioriHybrid Switching from Apriori to AprioriTid does involve It is not necessary to use the same algorithm in all the a cost. Assume that we decide to switch from Apriori passes over data. Figure 6 shows the execution times to AprioriTid at the end of the kth pass. In the for Apriori and AprioriTid for di erent passes over the (k + 1)th pass, after nding the candidate itemsets dataset T10.I4.D100K. In the earlier passes, Apriori contained in a transaction, we will also have to add does better than AprioriTid. However, AprioriTid their IDs to C k+1 (see the description of AprioriTid beats Apriori in later passes. We observed similar in Section 2.2). Thus there is an extra cost incurred relative behavior for the other datasets, the reason in this pass relative to just running Apriori. It is only for which is as follows. Apriori and AprioriTid in the (k +2)th pass that we actually start running use the same candidate generation procedure and AprioriTid. Thus, if there are no large (k+1)-itemsets, therefore count the same itemsets. In the later or no (k + 2)-candidates, we will incur the cost of passes, the number of candidate itemsets reduces switching without getting any of the savings of using (see the size of Ck for Apriori and AprioriTid in AprioriTid. Figure 5). However, Apriori still examines every Figure 7 shows the performance of AprioriHybrid transaction in the database. On the other hand, relative to Apriori and AprioriTid for three datasets. rather than scanning the database, AprioriTid scans AprioriHybrid performs better than Apriori in almost C k for obtaining support counts, and the size of C k all cases. For T10.I2.D100K with 1.5% support, has become smaller than the size of the database. AprioriHybrid does a little worse than Apriori since When the C k sets can t in memory, we do not even the pass in which the switch occurred was the incur the cost of writing them to disk. last pass AprioriHybrid thus incurred the cost of 14 switching without realizing the bene ts. In general, Apriori the advantage of AprioriHybrid over Apriori depends 12 AprioriTid on how the size of the C k set decline in the later passes. If C k remains large until nearly the end and 10 then has an abrupt drop, we will not gain much by using AprioriHybrid since we can use AprioriTid only Time (sec) 8 for a short period of time after the switch. This is 6 what happened with the T20.I6.D100K dataset. On the other hand, if there is a gradual decline in the 4 size of C k , AprioriTid can be used for a while after the switch, and a signi cant improvement can be obtained 2 in the execution time. 0 1 2 3 4 5 6 7 3.7 Scale-up Experiment Figure 6: Per pass execution times of Apriori and Pass # Figure 8 shows how AprioriHybrid scales up as the AprioriTid (T10.I4.D100K, minsup = 0.75%) number of transactions is increased from 100,000 to 10 million transactions. We used the combinations Based on these observations, we can design a (T5.I2), (T10.I4), and (T20.I6) for the average sizes hybrid algorithm, which we call AprioriHybrid, that of transactions and itemsets respectively. All other uses Apriori in the initial passes and switches to parameters were the same as for the data in Table 3. AprioriTid when it expects that the set C k at the The sizes of these datasets for 10 million transactions end of the pass will t in memory. We use the were 239MB, 439MB and 838MB respectively. The following heuristic to estimate if C k would t in minimum support level was set to 0.75%. The memory in the next pass. At the end of the execution times are normalized with respect to the current pass, we have the counts of the candidates times for the 100,000 transaction datasets in the rst
  • 11. graph and with respect to the 1 million transaction dataset in the second. As shown, the execution times scale quite linearly. T10.I2.D100K 40 12 AprioriTid 35 Apriori T20.I6 AprioriHybrid T10.I4 10 T5.I2 30 25 8 Time (sec) Relative Time 20 6 15 4 10 5 2 0 2 1.5 1 0.75 0.5 0.33 0.25 0 Minimum Support 100 250 500 750 1000 T10.I4.D100K 14 Number of Transactions (in ’000s) 55 T20.I6 AprioriTid T10.I4 50 Apriori 12 T5.I2 45 AprioriHybrid 10 40 Relative Time 35 8 Time (sec) 30 25 6 20 4 15 10 2 5 0 0 1 2.5 5 7.5 10 2 1.5 1 0.75 0.5 0.33 0.25 Number of Transactions (in Millions) Minimum Support T20.I6.D100K Figure 8: Number of transactions scale-up 700 AprioriTid 600 Apriori AprioriHybrid Next, we examined how AprioriHybrid scaled up with the number of items. We increased the num- 500 ber of items from 1000 to 10,000 for the three pa- rameter settings T5.I2.D100K, T10.I4.D100K and Time (sec) 400 T20.I6.D100K. All other parameters were the same as for the data in Table 3. We ran experiments for a minimum support at 0.75%, and obtained the results 300 200 shown in Figure 9. The execution times decreased a little since the average support for an item decreased 100 as we increased the number of items. This resulted in fewer large itemsets and, hence, faster execution 0 times. Finally, we investigated the scale-up as we increased 2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support the average transaction size. The aim of this Figure 7: Execution times: AprioriHybrid experiment was to see how our data structures scaled with the transaction size, independent of other factors like the physical database size and the number of large itemsets. We kept the physical size of the
  • 12. 45 30 T20.I6 500 40 T10.I4 750 T5.I2 25 1000 35 30 20 Time (sec) Time (sec) 25 15 20 15 10 10 5 5 0 0 1000 2500 5000 7500 10000 5 10 20 30 40 50 Number of Items Transaction Size Figure 9: Number of items scale-up Figure 10: Transaction size scale-up database roughly constant by keeping the product posed algorithms can be combined into a hybrid al- of the average transaction size and the number of gorithm, called AprioriHybrid, which then becomes transactions constant. The number of transactions the algorithm of choice for this problem. Scale-up ex- ranged from 200,000 for the database with an average periments showed that AprioriHybrid scales linearly transaction size of 5 to 20,000 for the database with with the number of transactions. In addition, the ex- an average transaction size 50. Fixing the minimum ecution time decreases a little as the number of items support as a percentage would have led to large in the database increases. As the average transaction increases in the number of large itemsets as the size increases (while keeping the database size con- transaction size increased, since the probability of stant), the execution time increases only gradually. a itemset being present in a transaction is roughly These experiments demonstrate the feasibility of us- proportional to the transaction size. We therefore ing AprioriHybrid in real applications involving very xed the minimum support level in terms of the large databases. number of transactions. The results are shown in The algorithms presented in this paper have been Figure 10. The numbers in the key (e.g. 500) refer implemented on several data repositories, including to this minimum support. As shown, the execution the AIX le system, DB2/MVS, and DB2/6000. times increase with the transaction size, but only We have also tested these algorithms against real gradually. The main reason for the increase was that customer data, the details of which can be found in in spite of setting the minimum support in terms 5]. In the future, we plan to extend this work along of the number of transactions, the number of large the following dimensions: itemsets increased with increasing transaction length. Multiple taxonomies (is-a hierarchies) over items A secondary reason was that nding the candidates are often available. An example of such a present in a transaction took a little longer time. hierarchy is that a dish washer is a kitchen appliance is a heavy electric appliance, etc. We 4 Conclusions and Future Work would like to be able to nd association rules that We presented two new algorithms, Apriori and Apri- use such hierarchies. oriTid, for discovering all signi cant association rules We did not consider the quantities of the items between items in a large database of transactions. bought in a transaction, which are useful for some We compared these algorithms to the previously applications. Finding such rules needs further known algorithms, the AIS 4] and SETM 13] algo- work. rithms. We presented experimental results, showing that the proposed algorithms always outperform AIS The work reported in this paper has been done and SETM. The performance gap increased with the in the context of the Quest project at the IBM Al- problem size, and ranged from a factor of three for maden Research Center. In Quest, we are exploring small problems to more than an order of magnitude the various aspects of the database mining problem. for large problems. Besides the problem of discovering association rules, We showed how the best features of the two pro- some other problems that we have looked into include
  • 13. the enhancement of the database capability with clas- 10] D. H. Fisher. Knowledge acquisition via incre- si cation queries 2] and similarity queries over time mental conceptual clustering. Machine Learning, sequences 1]. We believe that database mining is an 2(2), 1987. important new application area for databases, com- bining commercial interest with intriguing research 11] J. Han, Y. Cai, and N. Cercone. Knowledge questions. discovery in databases: An attribute oriented approach. In Proc. of the VLDB Conference, pages 547{559, Vancouver, British Columbia, Acknowledgment We wish to thank Mike Carey Canada, 1992. for his insightful comments and suggestions. 12] M. Holsheimer and A. Siebes. Data mining: The References search for knowledge in databases. Technical Report CS-R9406, CWI, Netherlands, 1994. 1] R. Agrawal, C. Faloutsos, and A. Swami. Ef- cient similarity search in sequence databases. 13] M. Houtsma and A. Swami. Set-oriented mining In Proc. of the Fourth International Conference of association rules. Research Report RJ 9567, on Foundations of Data Organization and Algo- IBM Almaden Research Center, San Jose, Cali- rithms, Chicago, October 1993. fornia, October 1993. 2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and 14] R. Krishnamurthy and T. Imielinski. Practi- A. Swami. An interval classi er for database tioner problems in need of database research: Re- mining applications. In Proc. of the VLDB search directions in knowledge discovery. SIG- Conference, pages 560{573, Vancouver, British MOD RECORD, 20(3):76{78, September 1991. Columbia, Canada, 1992. 15] P. Langley, H. Simon, G. Bradshaw, and 3] R. Agrawal, T. Imielinski, and A. Swami. J. Zytkow. Scienti c Discovery: Computational Database mining: A performance perspective. Explorations of the Creative Process. MIT Press, IEEE Transactions on Knowledge and Data En- 1987. gineering, 5(6):914{925, December 1993. Special 16] H. Mannila and K.-J. Raiha. Dependency Issue on Learning and Discovery in Knowledge- inference. In Proc. of the VLDB Conference, Based Databases. pages 155{158, Brighton, England, 1987. 4] R. Agrawal, T. Imielinski, and A. Swami. Mining 17] H. Mannila, H. Toivonen, and A. I. Verkamo. association rules between sets of items in large E cient algorithms for discovering association databases. In Proc. of the ACM SIGMOD Con- rules. In KDD-94: AAAI Workshop on Knowl- ference on Management of Data, Washington, edge Discovery in Databases, July 1994. D.C., May 1993. 18] S. Muggleton and C. Feng. E cient induction 5] R. Agrawal and R. Srikant. Fast algorithms for of logic programs. In S. Muggleton, editor, mining association rules in large databases. Re- Inductive Logic Programming. Academic Press, search Report RJ 9839, IBM Almaden Research 1992. Center, San Jose, California, June 1994. 19] J. Pearl. Probabilistic reasoning in intelligent 6] D. S. Associates. The new direct marketing. systems: Networks of plausible inference, 1992. Business One Irwin, Illinois, 1990. 20] G. Piatestsky-Shapiro. Discovery, analy- 7] R. Brachman et al. Integrated support for data sis, and presentation of strong rules. In archeology. In AAAI-93 Workshop on Knowledge G. Piatestsky-Shapiro, editor, Knowledge Dis- Discovery in Databases, July 1993. covery in Databases. AAAI/MIT Press, 1991. 8] L. Breiman, J. H. Friedman, R. A. Olshen, and 21] G. Piatestsky-Shapiro, editor. Knowledge Dis- C. J. Stone. Classi cation and Regression Trees. covery in Databases. AAAI/MIT Press, 1991. Wadsworth, Belmont, 1984. 22] J. R. Quinlan. C4.5: Programs for Machine 9] P. Cheeseman et al. Autoclass: A bayesian Learning. Morgan Kaufman, 1993. classi cation system. In 5th Int'l Conf. on Machine Learning. Morgan Kaufman, June 1988.