Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Massively distributed environments and closed itemset mining
1. Massively Distributed Environments and Closed
Itemset Mining: the DCIM Approach
Mehdi Zitouni & Reza Akbarinia & Sadok Ben Yahia & Florent Masseglia
Mehdi.Zitouni@inria.fr
CAiSE 2017, june 16th 2017, Essen Germany
1
2. Plan
2
1 • Knowledge descovery in big data
2 • DCIM approach for CFI mining in big data
3 • Experimental results
4 • Conclusion
3. Big data mining
• Advances in hardware and software technologies : Internet, social
networks, smart phones, etc.
• Big data mining : multiple forms of knowledge
• Pattern recognition, statistics, databases, linguistics and visualization
3
?
ENOUGH
!!
6. Big data mining
• A class of useful patterns : Frequent Itemsets.
• Frequency of elements in a data base : behavior of the employees in
companies, behavior of the customers in stores, etc
• When data volume grow, frequent elements grow !
• Condensed representation of frequent patterns and gives the same results:
Closed Frequent Itemsets
6
7. Preliminary Notions : CFI
• Itemset support : the number of transactions containing the itemset
• Frequent itemset : its support is ≥ then a threshold σ specified by the user
• Closed frequent itemset : a condensed representation of frequent itemset,
• is frequent and closed (no superset that has the same support count)
example : having σ = 2
• A, B, C, E : items
• ABC, BCE, … : itemsets
7
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
2
3222
ABC
ABCE
ACEABE BCE
ABCE
3
2
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
8. Preliminary Notions : MapReduce
• Distributed data processing platform by Google 1 ,
• Available as open-source Apache Hadoop.
• Programming Model based on Key-Value pairs : map and reduce functions !
81 J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters
A, A, B,
B, B, C,
C, B, A
A, A, B
B, B, C
C, B, A
A,1 A,1 B,1
B,1 B,1 C,1
C,1 B,1 A,1
A,1 A,1 A,1
B,1 B,1 B,1
C,1 C,1
A, 3
B, 3
C, 2
example : Word Count
Map phase Reduce phase
9. DCIM algorithm
• Three steps :
1. Splitting : splits the dataset into multiple and successive parts
2. Job 1 : Frequency counting : first pass over the dataset and count the
support of each item and prune non-frequent ones
3. Job 2 : CFI Mining : mines the CFIs using prime number based approach
• Prime number based approach : a data modelization to avoid string operations
which are very costly in terms of communication and execution time.
9
X ; 2
Y ; 3
Z ; 5
Is X ⊂ X Y ?
X Y ; 2 x 3 = 6
If (6 % 2) = = 0
Then X ⊂ X Y True
example : membership test
10. DCIM algorithm : Frequency counting
10
Example : having σ = 2
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
items Support
A 3
B 4
C 4
D 1
E 4
items Support Primes
B 4 2
C 4 3
E 4 5
A 3 7
Descending order of supports
Itemset
Prime
Numbers
T id
Multiplicaton
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210
11. DCIM algorithm : CFI Mining “Map Phase”
11
• Sets of minimized contexts, denoted as Conditional-context.
• Conditional-context ?
Example : having σ = 2
Itemset
Prime
Numbers
T id
A C 7, 3 21
B C E 2, 3, 5 30
A B C E 7, 2, 3, 5 210
B E 2, 5 10
A B C E 7, 2, 3, 5 210
A-Conditional-context
Itemset
Prime
Numbers
T id
C E 3, 5 30
C E 3, 5 30
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 7, 2, 3, 5 210
B E 2, 5 10
B C E A 7, 2, 3, 5 210
Itemset
Prime
Numbers
T id
C 3 3
B C E 2, 3, 5 30
B C E 2, 3, 5 30
AB-Conditional-context
Remove «B»
12. DCIM algorithm : CFI Mining “Map Phase”
12
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
{B C E} = 30
30 = 2 × 3 × 5
6 = 2 × 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B E} = 10 10 = 2 × 5 {E} = 5 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
A B C E 2, 3, 5, 7 210
B E 2, 5 10
A B C E 2, 3, 5, 7 210
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
13. DCIM algorithm : CFI Mining “Reduce Phase”
13
no superset of the itemset in question that has the same support
count, GCD calculations
Example :
A-Conditional-context : {7}
3
30
30
Output : { 3 × 7 = 21 } → A C
GCD = 3
6
6
2
6
Output : { 5 × 2 = 10 } → B E
E-Conditional-context : {5}
GCD = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210
15. 15
Experimental Results : Datasets
• Wikipedia Articles
• Each line mimics a research article,
• 7,892,123 transactions with 6,853,616 items,
• Maximal length of a transaction is 153,953,
• Clue Web
• One billion web pages in ten languages,
• 53,268,952 transactions with 11,153,752 items,
• Maximal length of a transaction is 689,153,
16. 16
Experimental Results : Setup and implementation
• One of the clusters of Grid5000
• 32 nodes equipped with Hadoop 2.6.0 version,
• 96 Gigabytes Ram,
• 2,9 to 3,9 Ghz Processors,
• Java and Openjdk-7-jdk.
• Compared to a basic implementation of CLOSET algorithm in
MapReduce and the parallel FP-growth.
• Execution time and speedup for multiple values of σ.
20. Conclusion
• A reliable and efficient parallel algorithm for CFI mining namely DCIM,
• DCIM shows significantly better performances than approaches from
the state of the art,
• An efficient data modeling : Prime numbers processings !
→ The approach is effective and efficient
• CFI mining in data streams
20