2. Introduction
to sequence mining
Why sequence mining?
Sequence mining algorithms
SPADE
Motivation
Definitions and examples
Algorithm
Implementation
Data Mining 11/8/2011 2
3. Aim - finding statistically relevant patterns
between data examples where the values are
delivered in a sequence
Originallyintroduced for market basket
analysis - customer behaviour predictions
2 types of sequence mining:
string mining – biology (gene/protein sequences)
itemset mining - marketing and CRM applications
Data Mining 11/8/2011 3
4. Discovering patterns:
Bookstore: 70% of the people who buy Jane
Austen’s “Pride and Prejudice” also buy “Emma”
within a month
Website: finding sequences of most frequently
accessed pages
Usage:
Promotions
Shelf placement
Restructure the website
Recommender systems
Data Mining 11/8/2011 4
6. Problems of existing solutions
Repeated database scans
Complex internal data structures
Key features of SPADE:
Fixed number of database scans
Vertical id-list database format
Decomposition of search space into smaller
pieces – processed independently
Data Mining 11/8/2011 6
7. Itemset: set of m distinct items
I = {i1, i2, …, im }
Event: non-empty collection of items
(i1,i2 … ik)
Sequence : ordered list of events
< e1 -> e2 -> … -> en >
K-sequence : sequence with k items
(B->AC) – 3-sequence
Data Mining 11/8/2011 7
8. Subsequence: given two sequences α=<a1 a2 … an>
and β=<b1 b2 … bm>, α is called a subsequence of
β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2
<…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn
Examples:
1. (B->AC) is a subsequence of (AB->E->ACD)
2. (AB->E) is not a subsequence of (ABE)
Data Mining 11/8/2011 8
12. D->BF->A
Step 3 : D->BF->A
Not space-efficient
Solution: 2 columns - (sid,eid) for each sequence
Eid – id of the sequence’s last item
Data Mining 11/8/2011 12
16. Decomposing the latice => smaller pieces
that can be solved independently
Equivalence classes
2 sequences are in the same class (Θk) if they
share a common k length prefix
Example
k=1 : Θ1 -> {[A],[B],[D],[F]}
Data Mining 11/8/2011 16
19. SPADE(min_sup,D)
//min_sup – minimum_support
//D –initial dataset
F1<- {frequent items or 1-sequences}
F2<- {frequent 2-sequences}
Ε <- {equivalence classes [X] Θ1 }
for all [X] in E
enumerate_frequent_seq([X],min_sup)
Data Mining 11/8/2011 19
20. Enumerate_frequent_seq(S,min_sup)
for all Ai in S
Ti <- {}
for all Aj in S, with j≥i
R<- Ai v Aj (join)
if R satisfies min_sup
Ti <- Ti U {R}
end
Enumerate_frequent_seq(Ti , min_sup) //DFS
end
For all non-empty Ti
Enumerate_frequent_seq(Ti , min_sup) //BFS
Data Mining 11/8/2011 20
21. The R Project for Statistical Computing
developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John
Chambers and colleagues
Different implementation of S language
arulesSequences package
Data Mining 11/8/2011 21