1. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.354-358
Generating Similar Item Sets Of Temporal Databases Using
Spamine Algorithm
T. Sowmiya1, V. Thangamani2
1
pg Scholar, Dept Of Cse, Vel Tech Dr.Rr & Dr.Sr Technical University, Chennai-62.
2
pg Scholar, Dept Of It, Anna University, Guindy, Chennai-25.
ABSTRACT
Data mining is the process of extracting ecology, and biology.
interesting like non-trivial, implicit, previously 2. RELATED WORKS:
unknown and potentially useful information or Many related works are applicable to
patterns from large information repositories such this paper that works includes the induction
as: relational database, data warehouses, XML by Bremen et al. and quinoa “classification
repository, etc. Data mining is known as one of rules” in 1984 and 1993, Spiegel halter et al.
the core processes of Knowledge Discovery in at “discovery of causal rules” in 1993,
Database (KDD). Association rule mining is a Muggleton and Feng at “learning of logical
popular and well researched method for definitions” in 1992, Langley et al. at “fitting
discovering interesting relationships among of functions to data” in 1987, Cheeseman et
various data items in large databases. The al. at “clustering” in 1988. The goal of
existing system uses only frequent item sets for association rule discovery is to find associations
association rule mining. It may easily cause among items from a set of transactions, each of
thrashing when dataset becomes large and which contains a set of items.
sparse. To overcome this spamine algorithm is
used here for computation of support values of all Frequent item set generator:
possible item sets at each time point and Mohammed Saki discovers Charm frequent
generates their support sequences and compares item set generator generates the closed frequent item
the generated support time sequences with a sets and not the association rules on February 25,
given reference sequence and finds similar item 2001. Jawed Han’s research group at Simon Fraser
sets. University discovers FP-growth generates frequent
Item sets for association rules from It generates all
Keywords: Association rule mining, Support frequent item sets satisfying a given minimum
values, Similar item set, Candidate item sets. support by growing a frequent pattern tree structure
that stores compressed information about the
1. INTRODUCTION: frequent-growth can avoid repeated database scans
Generating the support time sequences by and also avoids the generation of a large number of
reducing the candidate item sets, using the Candidate item sets. Jawed Han and Jean Pei
Association rules discover interrelation ships discovers FP-growth implementation in February 5,
among various data items in transactional data. 2001, it provides the improvement when compared
Similarity-profiled temporal association mining is to the earlier version of experimental results. Jawed
to discover all associated item sets whose Han’s finds Closet; it is a frequent item set generator
prevalence variations over time are similar to for association rules from research group on
the reference sequence under a threshold. September 21, 2000 .Jawed Han and Jean Pei
Similarity-profiled temporal association mining provided the closet implementation.
can reveal interesting relationships of data These implementations of FP-growth and
items that co occur with a particular event Closet only generate the frequent item sets, and not
over time. Using the, time stamped transaction the association rules. Geoff Webb implements the
database and a user defined reference sequence Magnum Opus on February 1, 2001. Magnum Opus
of interest over time. One major area of data directly generates association rules from a dataset
mining from these data is association pattern based on a specified search preference. Magnum
analysis. The discovery of association rules Opus is a unique technique for the search algorithm
paid attention to temporal information, which is based on an efficient admissible algorithm for
implicitly related to transaction data, e.g., the unordered search.
time that a transaction is executed, and
discovered association patterns that vary over Discovering calendar based temporal association
time. Association patterns can give important rules:
insight into many Application domains such A temporal association rule is an
as business, agriculture, earth science, association rule that holds during specific time
354 | P a g e
2. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.354-358
intervals. S.Ramaswamy, S.mahajan, and 4. MODULE DESCRIPTION:
A.Silberschatz, “on the discovery of interesting 4.1 Data Preprocessing
patterns in association rules”. B.Ozde, The temporal dataset given by the user can
S.Ramaswamy, andA.Silberschatz at “cyclic have any type of item sets. The proposed system is
association rules” was extended to approximately modeled such that the association rule mining
discover user-defined temporal patternsin algorithm could process only numeric item sets. So
association rules.the work in “on the discovery of the dataset given by the user is converted into
interesting patterns in association rules” is more numeric format and then the numeric data is split
flexible and practical than “cyclic association rules” into different time stamps to which the SPAMINE
however it requires user-defined calendar algebraic algorithm can be applied.
expressions in order to discover temporal patterns.
Y. Li, P. Ning, X.S. Wang, and S. Jajodia at 4.2 Calculation of Support Time Sequences
“Discovering calendar based temporal association Support time sequences can be constructed
rules” J.Data and Knowledge Engg., Vol.15, no.2, by two different ways in scanning the time stamped
2003 uses calendar schema for better result. As a transaction data set.
result this implements less priori then “on the 1. Lattice-dominant scan
discovery of interesting patterns in association rules” 2. Snapshot-dominant scan
and “cyclic association rules”. S.Ramaswamy, Lattice-dominant scan: The lattice-
S.mahajan, and A.Silberschatz, at “on the discovery dominant scan method reads a whole transaction
of interesting patterns in association rules” which data set from time slot t1 to time slot ten for candidate
discover temporal association rules for one user item sets of each size and generates their support
defined temporal pattern Y. Li, P. Ning, X.S. time sequences.
Wang, and S. Jajodia at “Discovering calendar
based temporal association rules” J.Data and Snapshot-dominant scan: The snapshot-
Knowledge Engg., Vol.15,no.2,2003 shows all dominant scan method repeats the scanning of
possible temporal patterns in calendar schema. It can transactions at each time slot. First, it counts the
potentially discover more temporal association rules. supports of all candidate item sets of different sizes
in the first time slot, and then it moves to the next
3. METHODS: time slot and repeats the process. This method
Association rule mining is a popular and incrementally generates the support time sequences
well researched method for discovering interesting with the processed time slots.
relationships among various data items in large SPAMINE algorithm uses Lattice-dominant scan
databases. The existing apriori based temporal to read the whole Transaction dataset from time slot
association rule mining. Find all frequent item sets t1 to time slot ten for candidate item sets and generate
which occur at least as frequent as a pre-determined their support time sequences. The similarity measure
minimum support count and Generate strong used in this algorithm is Euclidean distance.
association rules from the frequent item sets which Euclidean distance between two points
satisfy minimum confidence. The problem of mining (x1,y1) and (x2,y2) is calculated using the
the temporal pattern is formulated with a similarity- D= [(x2-x1)2+ (y2-y1)2]1/2
profiled subset specification, which consists of a Similarity measure (Euclidean distance) is
reference time sequence. applied to each candidate item set to calculate the
To overcome this problem this paper adopts following:
the Similarity-Profiled temporal Association 1. Upper and lower bound sequence
Mining method (SPAMINE)[1] algorithm divides 2. Upper lower-bounding distance
the mining process into two separate phases. Generating the support time sequences of
• The first phase computes the support values of all item sets is the core operation in similarity-profiled
possible item sets at each time point and association mining algorithm. The operation,
generates their support sequences. however, is very data intensive and sometimes can
• The second phase compares the generated produce the sequences of all combinations of items.
support time sequences with a given reference A different way is explored to estimate support time
sequence and finds similar item sets. sequences without examining an input data set.
A SPAMINE algorithm is developed Lower bounding distances are explored to
here for Temporal patterns are searched with a user find item sets whose support sequences could not
defined numeric reference sequence and consider the possibly be a best match with a reference sequence
prevalence similarities of all possible item sets, not under a dissimilarity threshold. In this concept of
only frequent item sets. The SPAMINE algorithm lower bounding distance, if the lower bounding
will reduce the search space. The computation cost distance of an item set does not satisfy the
is reduced. Both transaction and time are taken dissimilarity threshold, its true distance also does not
under consideration. satisfy the threshold.
355 | P a g e
3. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.354-358
Thus, the lower bounding distance can be un =min{support (J1, DNS), …., support
used to prune item sets early without the (KJ, DNS)}
computation of the true distances. Lower bounding Lower bound sequence:
distance is defined with upper and lower bound Lower bound support time sequence of item
support time sequences. set I, L1 =< l1 , …, in > is defined by
Support time sequences: l1 =max{(support (J1, D1)+support (I- J1,
Let D=D1U…Dun be a set of disjoint transactions. D1)-1), …, (support(Joke, D1)+support (I-
J={J1, …, Joke} be a set of all size k-1 subsets of a Joke, D1)-1),0 }
size k item set I. in=max{(support (J1, Den)+support(I- J1,
Upper bound sequence: Den)-1), …, (support (Joke, Den)+support(I-
Upper bound support time sequence of item Joke, Den)-1),0 }
set I, U1 =< u1 , …, un> is calculated using The lower lower-bounding distance between R and
u1=min{support (J1, D1), …., support (KJ, L, Dib (R, L) is defined to
D1)}
Lower Bounding Distance D(RU , LL)
The upper lower-bounding distance The lower-bounding distance,
between Rand U, Dub(R, U) is defined to Dab(R, U, L) = Dub (R, U)+
D(RU , UL) Dib(R, L)
4.3 Generation of Similar Item Sets such as upper bound support time sequence, lower
The algorithm compares the calculated distances bound support time sequence and lower bounding
with the dissimilarity threshold in order to prune distance along with the user defined reference
the dissimilar item sets and results in similar item sequence.
sets. The algorithm uses all the calculated values
SYSTEM FLOW DIAGRAM:
Read Datasets and Threshold Value
Divide into Time Stamps
Convert Data into Numeric Values
Find Item Sets
Find Support Values
Calculate Upper Bound and Lower Bound
Apply Upper Distance Formula to Find the
Similar Item Sets
Compare With Minimum Threshold
Output the Similar Item Sets That Satisfy the
Threshold
FIGURE 1: FINDING SIMILAR ITEM SETS
356 | P a g e
4. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.354-358
here is depends on data distribution,
this type of reference sequence, dissimilarity
threshold.
The future improvement may be of
ESULT: different similarity models for temporal patterns can
Experimental results on fake and real data sets be explored. The current similarity model using a Lp
showed that the SPAMINE algorithm used here norm-based similarity function is a little rigid in
is computationally efficient and can produce finding similar temporal patterns. It may be
meaningful results from real data. The interesting to consider not only a relaxed similarity
similarity-profiled temporal association mining model to catch temporal patterns that show similar
method algorithm uses the lower bounding trends but also phase shifts in time. For example, the
distance of the bounds of support sequences, sale of items for cleanup such as chain saws and
and the monotonicity property of the upper mops would increase after a storm rather than
lower bounding distance without during the storm. The current project considered
compromising the correctness and completeness whole sequence matching for the similar temporal
of the mining results. For the significant patterns. Subsequence matching models may be
reduction of search space by pruning candidate more flexible for the patterns.
item sets However, the pruning scheme used
FIGURE 2: FABRICATION OF SIMILAR ITEM SETS
CONCLUSION: distance without compromising the correctness and
The similarity-profiled temporal completeness of the mining results. The mining
association mining method (SPAMINE)[1] algorithm problem similarity-profiled temporal association
substantially reduced the search space by pruning patterns are formulated and an algorithm is proposed
candidate item sets using the lower bounding to discover them. Experimental results on data sets
distance of the bounds of support sequences, and the showed that the SPAMINE algorithm is
monotonicity property of the upper lower-bounding
357 | P a g e
5. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 2, Issue 6, November- December 2012, pp.354-358
computationally efficient and can produce meaningful results from real data.
J.Lin, “Discovering Frequent Event
Patterns with Multiple Granularities in
Time Sequences,” IEEE Trans.
Knowledge and Data Eng., vol. 10, no. 2,
REFERENCES: Mar.-Apr. 1998.
1. Jin Soung Yoo and Shashi Shekhar, 7. G.Dong and J.Li, “Efficient Mining of
“Similarity-Profiled Temporal Association Emerging Patterns: Discovering Trends
Mining”, IEEE 2009. and Differences,” Proc. ACM SIGKDD,
2. Juan M.Ale and Gustavo H.Rossi R,”An 1999.
approach to discovering temporal 8. B. Liu, W.Hsu, and Y.Ma,
association rules”, ACM SIGDD 2002. “Discovering the Set of Fundamental
3. Keshri Verma and O.P.Vyas, ”Efficient Rule Change,” Proc. ACM SIGKDD,
calendar based temporal association rule”, 2001.
SIGMOD Record, Vol.34, No.3, 2005. 9. W.Teng, M.Chen, and P.Yu, “A
4. R. Agarwal and R. Srikant, “Fast Regression-Based Temporal Pattern
Algorithms for Mining Association Mining Scheme for Data Streams,”
Rules,” Proc. Int’l Conf. Very Large Proc. Int’l Conf. Very Large Databases
Databases (VLDB), 1994. (VLDB), 2003.
5. R.Srikant and R.Agrawal, “Mining 10. Y. Li, P.Ning, X.S. Wang, and
Generalized Association Rules,” Proc. S.Jajodia, “Discovering Calendar Based
Int’l Conf. Very Large Databases (VLDB), Temporal Association Rules,” J. Data and
1995. Knowledge Eng., vol. 15, no. 2, 2003.
6. C.Bettini, X.Wang, S.Jajodia, and
358 | P a g e