SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Massively Distributed Environments and Closed
Itemset Mining: the DCIM Approach
Mehdi Zitouni & Reza Akbarinia & Sadok Ben Yahia & Florent Masseglia
Mehdi.Zitouni@inria.fr
CAiSE 2017, june 16th 2017, Essen Germany
1
Plan
2
1 • Knowledge descovery in big data
2 • DCIM approach for CFI mining in big data
3 • Experimental results
4 • Conclusion
Big data mining
• Advances in hardware and software technologies : Internet, social
networks, smart phones, etc.
• Big data mining : multiple forms of knowledge
• Pattern recognition, statistics, databases, linguistics and visualization
3
?
ENOUGH
!!
Knowledge discovery
4
Knowledge discovery
5
Big data mining
• A class of useful patterns : Frequent Itemsets.
• Frequency of elements in a data base : behavior of the employees in
companies, behavior of the customers in stores, etc
• When data volume grow, frequent elements grow !
• Condensed representation of frequent patterns and gives the same results:
Closed Frequent Itemsets
6
Preliminary Notions : CFI
• Itemset support : the number of transactions containing the itemset
• Frequent itemset : its support is ≥ then a threshold σ specified by the user
• Closed frequent itemset : a condensed representation of frequent itemset,
• is frequent and closed (no superset that has the same support count)
example : having σ = 2
• A, B, C, E : items
• ABC, BCE, … : itemsets
7
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
2
3222
ABC
ABCE
ACEABE BCE
ABCE
3
2
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
Preliminary Notions : MapReduce
• Distributed data processing platform by Google 1 ,
• Available as open-source Apache Hadoop.
• Programming Model based on Key-Value pairs : map and reduce functions !
81 J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters
A, A, B,
B, B, C,
C, B, A
A, A, B
B, B, C
C, B, A
A,1 A,1 B,1
B,1 B,1 C,1
C,1 B,1 A,1
A,1 A,1 A,1
B,1 B,1 B,1
C,1 C,1
A, 3
B, 3
C, 2
example : Word Count
Map phase Reduce phase
DCIM algorithm
• Three steps :
1. Splitting : splits the dataset into multiple and successive parts
2. Job 1 : Frequency counting : first pass over the dataset and count the
support of each item and prune non-frequent ones
3. Job 2 : CFI Mining : mines the CFIs using prime number based approach
• Prime number based approach : a data modelization to avoid string operations
which are very costly in terms of communication and execution time.
9
X ; 2
Y ; 3
Z ; 5
Is X ⊂ X Y ?
X Y ; 2 x 3 = 6
If (6 % 2) = = 0
Then X ⊂ X Y True
example : membership test
DCIM algorithm : Frequency counting
10
Example : having σ = 2
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
items Support
A 3
B 4
C 4
D 1
E 4
items Support Primes
B 4 2
C 4 3
E 4 5
A 3 7
Descending order of supports
Itemset
Prime
Numbers
T id
Multiplicaton
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210
DCIM algorithm : CFI Mining “Map Phase”
11
• Sets of minimized contexts, denoted as Conditional-context.
• Conditional-context ?
Example : having σ = 2
Itemset
Prime
Numbers
T id
A C 7, 3 21
B C E 2, 3, 5 30
A B C E 7, 2, 3, 5 210
B E 2, 5 10
A B C E 7, 2, 3, 5 210
A-Conditional-context
Itemset
Prime
Numbers
T id
C E 3, 5 30
C E 3, 5 30
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 7, 2, 3, 5 210
B E 2, 5 10
B C E A 7, 2, 3, 5 210
Itemset
Prime
Numbers
T id
C 3 3
B C E 2, 3, 5 30
B C E 2, 3, 5 30
AB-Conditional-context
Remove «B»
DCIM algorithm : CFI Mining “Map Phase”
12
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
{B C E} = 30
30 = 2 × 3 × 5
6 = 2 × 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B E} = 10 10 = 2 × 5 {E} = 5 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
A B C E 2, 3, 5, 7 210
B E 2, 5 10
A B C E 2, 3, 5, 7 210
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
DCIM algorithm : CFI Mining “Reduce Phase”
13
no superset of the itemset in question that has the same support
count, GCD calculations
Example :
A-Conditional-context : {7}
3
30
30
Output : { 3 × 7 = 21 } → A C
GCD = 3
6
6
2
6
Output : { 5 × 2 = 10 } → B E
E-Conditional-context : {5}
GCD = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210
14
DCIM algorithm : CFI Mining “Reduce Phase”
Map Outputs
{A} = 7 : {C} = 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{E} = 5 : {B} = 2
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Reduce Inputs CFI Mining → Reduce Outputs
{A} = 7 : {3, 30, 30}
{AB} = 14 : {15, 15}
{AE} ? → {AE} ⊂ {ABCE}
GCD(3, 30, 30) = 3 = C → 3 x 7 = 21 : {AC} is CFI
GCD(15,15) = 15 = CE → 15 x 14 = 210 : {ABCE} is CFI
STOP !
{E} = 5 : {6, 6, 2, 6}
{EC} = 15 : {2, 2, 2}
GCD(6, 6, 2, 6) = 2 = B → 2 x 5 = 10 : {BE} is CFI
GCD(2, 2, 2) = 2 → 2 x 15 = 30 : {BCE} is CFI
{C} = 3 : {7, 2, 2, 2} GCD(7, 2, 2, 2) = 1 → 3 = C : {C} is CFI
CFIs = {AC, ABCE, BE, BCE}
15
Experimental Results : Datasets
• Wikipedia Articles
• Each line mimics a research article,
• 7,892,123 transactions with 6,853,616 items,
• Maximal length of a transaction is 153,953,
• Clue Web
• One billion web pages in ten languages,
• 53,268,952 transactions with 11,153,752 items,
• Maximal length of a transaction is 689,153,
16
Experimental Results : Setup and implementation
• One of the clusters of Grid5000
• 32 nodes equipped with Hadoop 2.6.0 version,
• 96 Gigabytes Ram,
• 2,9 to 3,9 Ghz Processors,
• Java and Openjdk-7-jdk.
• Compared to a basic implementation of CLOSET algorithm in
MapReduce and the parallel FP-growth.
• Execution time and speedup for multiple values of σ.
17
Efficiency : Wikipedia Articles
18
Speedup : ClueWeb
Conclusion
• Big data : game changing revolution !!
19
Conclusion
• A reliable and efficient parallel algorithm for CFI mining namely DCIM,
• DCIM shows significantly better performances than approaches from
the state of the art,
• An efficient data modeling : Prime numbers processings !
→ The approach is effective and efficient
• CFI mining in data streams
20
21
Thank you !
Questions ?

Weitere ähnliche Inhalte

Was ist angesagt?

Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...Hsien-Hsin Sean Lee, Ph.D.
 
Appendix 3 Linear Functions PowerPoint
Appendix 3 Linear Functions PowerPointAppendix 3 Linear Functions PowerPoint
Appendix 3 Linear Functions PowerPointapaglione
 
Squaring binomials
Squaring binomialsSquaring binomials
Squaring binomialsAndrés
 
River Valley Emath Paper 1_solutions_printed
River Valley Emath Paper 1_solutions_printedRiver Valley Emath Paper 1_solutions_printed
River Valley Emath Paper 1_solutions_printedFelicia Shirui
 
Fs for creditors aging report
Fs for creditors aging reportFs for creditors aging report
Fs for creditors aging reportAnand prakash
 
5HBC2012 Conic Worksheet
5HBC2012 Conic Worksheet5HBC2012 Conic Worksheet
5HBC2012 Conic WorksheetA Jorge Garcia
 
X 1 cq - exponentes
X 1 cq - exponentesX 1 cq - exponentes
X 1 cq - exponentesaldosivi98
 
Data visualization with multiple groups using ggplot2
Data visualization with multiple groups using ggplot2Data visualization with multiple groups using ggplot2
Data visualization with multiple groups using ggplot2Rupak Roy
 
Geo Spatial Plot using R
Geo Spatial Plot using R Geo Spatial Plot using R
Geo Spatial Plot using R Rupak Roy
 
Spm last minute revision mt
Spm last minute revision mtSpm last minute revision mt
Spm last minute revision mtA'dilah Hanum
 

Was ist angesagt? (19)

Funções 2
Funções  2Funções  2
Funções 2
 
Module 6.7
Module 6.7Module 6.7
Module 6.7
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
 
Productos notables web 2.o
Productos notables web 2.oProductos notables web 2.o
Productos notables web 2.o
 
Appendix 3 Linear Functions PowerPoint
Appendix 3 Linear Functions PowerPointAppendix 3 Linear Functions PowerPoint
Appendix 3 Linear Functions PowerPoint
 
Fixed point scaling
Fixed point scalingFixed point scaling
Fixed point scaling
 
Squaring binomials
Squaring binomialsSquaring binomials
Squaring binomials
 
Alg2 lesson 5-4
Alg2 lesson 5-4Alg2 lesson 5-4
Alg2 lesson 5-4
 
River Valley Emath Paper 1_solutions_printed
River Valley Emath Paper 1_solutions_printedRiver Valley Emath Paper 1_solutions_printed
River Valley Emath Paper 1_solutions_printed
 
Fs for creditors aging report
Fs for creditors aging reportFs for creditors aging report
Fs for creditors aging report
 
5HBC2012 Conic Worksheet
5HBC2012 Conic Worksheet5HBC2012 Conic Worksheet
5HBC2012 Conic Worksheet
 
Mate tarea - 5º
Mate   tarea - 5ºMate   tarea - 5º
Mate tarea - 5º
 
X 1 cq - exponentes
X 1 cq - exponentesX 1 cq - exponentes
X 1 cq - exponentes
 
Complex Integral
Complex IntegralComplex Integral
Complex Integral
 
Data visualization with multiple groups using ggplot2
Data visualization with multiple groups using ggplot2Data visualization with multiple groups using ggplot2
Data visualization with multiple groups using ggplot2
 
Geo Spatial Plot using R
Geo Spatial Plot using R Geo Spatial Plot using R
Geo Spatial Plot using R
 
Spm last minute revision mt
Spm last minute revision mtSpm last minute revision mt
Spm last minute revision mt
 
Per6 basis_Representations Of Integers
Per6 basis_Representations Of IntegersPer6 basis_Representations Of Integers
Per6 basis_Representations Of Integers
 

Ähnlich wie Massively distributed environments and closed itemset mining

Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Speeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using CodesSpeeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using CodesNAVER Engineering
 
CS540-2-lecture11 - Copy.ppt
CS540-2-lecture11 - Copy.pptCS540-2-lecture11 - Copy.ppt
CS540-2-lecture11 - Copy.pptssuser0be977
 
K means clustering
K means clusteringK means clustering
K means clusteringAhmedasbasb
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer developmentAndrey Karpov
 
Efficient_Cube_computation.ppt
Efficient_Cube_computation.pptEfficient_Cube_computation.ppt
Efficient_Cube_computation.pptKulwinder Padda
 
important C questions and_answers praveensomesh
important C questions and_answers praveensomeshimportant C questions and_answers praveensomesh
important C questions and_answers praveensomeshpraveensomesh
 
FPGA based BCH Decoder
FPGA based BCH DecoderFPGA based BCH Decoder
FPGA based BCH Decoderijsrd.com
 
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATIONSCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATIONaftab alam
 
Mcs 10 104 compiler design dec 2014
Mcs 10 104 compiler design dec 2014Mcs 10 104 compiler design dec 2014
Mcs 10 104 compiler design dec 2014Sreeju Sree
 

Ähnlich wie Massively distributed environments and closed itemset mining (20)

LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
3rd Semester Computer Science and Engineering (ACU) Question papers
3rd Semester Computer Science and Engineering  (ACU) Question papers3rd Semester Computer Science and Engineering  (ACU) Question papers
3rd Semester Computer Science and Engineering (ACU) Question papers
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Speeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using CodesSpeeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using Codes
 
Code Optimization.ppt
Code Optimization.pptCode Optimization.ppt
Code Optimization.ppt
 
CS540-2-lecture11 - Copy.ppt
CS540-2-lecture11 - Copy.pptCS540-2-lecture11 - Copy.ppt
CS540-2-lecture11 - Copy.ppt
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
 
Efficient_Cube_computation.ppt
Efficient_Cube_computation.pptEfficient_Cube_computation.ppt
Efficient_Cube_computation.ppt
 
important C questions and_answers praveensomesh
important C questions and_answers praveensomeshimportant C questions and_answers praveensomesh
important C questions and_answers praveensomesh
 
FPGA based BCH Decoder
FPGA based BCH DecoderFPGA based BCH Decoder
FPGA based BCH Decoder
 
3rd Semester Computer Science and Engineering (ACU-2022) Question papers
3rd Semester Computer Science and Engineering  (ACU-2022) Question papers3rd Semester Computer Science and Engineering  (ACU-2022) Question papers
3rd Semester Computer Science and Engineering (ACU-2022) Question papers
 
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATIONSCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
 
Mcs 10 104 compiler design dec 2014
Mcs 10 104 compiler design dec 2014Mcs 10 104 compiler design dec 2014
Mcs 10 104 compiler design dec 2014
 
2 funda.ppt
2 funda.ppt2 funda.ppt
2 funda.ppt
 
01_intro-cpp.ppt
01_intro-cpp.ppt01_intro-cpp.ppt
01_intro-cpp.ppt
 
01_intro-cpp.ppt
01_intro-cpp.ppt01_intro-cpp.ppt
01_intro-cpp.ppt
 
5. Arithmaticn combinational Ckt.ppt
5. Arithmaticn combinational Ckt.ppt5. Arithmaticn combinational Ckt.ppt
5. Arithmaticn combinational Ckt.ppt
 
C test
C testC test
C test
 

Kürzlich hochgeladen

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 

Kürzlich hochgeladen (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 

Massively distributed environments and closed itemset mining

  • 1. Massively Distributed Environments and Closed Itemset Mining: the DCIM Approach Mehdi Zitouni & Reza Akbarinia & Sadok Ben Yahia & Florent Masseglia Mehdi.Zitouni@inria.fr CAiSE 2017, june 16th 2017, Essen Germany 1
  • 2. Plan 2 1 • Knowledge descovery in big data 2 • DCIM approach for CFI mining in big data 3 • Experimental results 4 • Conclusion
  • 3. Big data mining • Advances in hardware and software technologies : Internet, social networks, smart phones, etc. • Big data mining : multiple forms of knowledge • Pattern recognition, statistics, databases, linguistics and visualization 3 ? ENOUGH !!
  • 6. Big data mining • A class of useful patterns : Frequent Itemsets. • Frequency of elements in a data base : behavior of the employees in companies, behavior of the customers in stores, etc • When data volume grow, frequent elements grow ! • Condensed representation of frequent patterns and gives the same results: Closed Frequent Itemsets 6
  • 7. Preliminary Notions : CFI • Itemset support : the number of transactions containing the itemset • Frequent itemset : its support is ≥ then a threshold σ specified by the user • Closed frequent itemset : a condensed representation of frequent itemset, • is frequent and closed (no superset that has the same support count) example : having σ = 2 • A, B, C, E : items • ABC, BCE, … : itemsets 7 T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E 2 3222 ABC ABCE ACEABE BCE ABCE 3 2 T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E
  • 8. Preliminary Notions : MapReduce • Distributed data processing platform by Google 1 , • Available as open-source Apache Hadoop. • Programming Model based on Key-Value pairs : map and reduce functions ! 81 J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters A, A, B, B, B, C, C, B, A A, A, B B, B, C C, B, A A,1 A,1 B,1 B,1 B,1 C,1 C,1 B,1 A,1 A,1 A,1 A,1 B,1 B,1 B,1 C,1 C,1 A, 3 B, 3 C, 2 example : Word Count Map phase Reduce phase
  • 9. DCIM algorithm • Three steps : 1. Splitting : splits the dataset into multiple and successive parts 2. Job 1 : Frequency counting : first pass over the dataset and count the support of each item and prune non-frequent ones 3. Job 2 : CFI Mining : mines the CFIs using prime number based approach • Prime number based approach : a data modelization to avoid string operations which are very costly in terms of communication and execution time. 9 X ; 2 Y ; 3 Z ; 5 Is X ⊂ X Y ? X Y ; 2 x 3 = 6 If (6 % 2) = = 0 Then X ⊂ X Y True example : membership test
  • 10. DCIM algorithm : Frequency counting 10 Example : having σ = 2 T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E items Support A 3 B 4 C 4 D 1 E 4 items Support Primes B 4 2 C 4 3 E 4 5 A 3 7 Descending order of supports Itemset Prime Numbers T id Multiplicaton C A 3, 7 21 B C E 2, 3, 5 30 B C E A 2, 3, 5, 7 210 B E 2, 5 10 B C E A 2, 3, 5, 7 210
  • 11. DCIM algorithm : CFI Mining “Map Phase” 11 • Sets of minimized contexts, denoted as Conditional-context. • Conditional-context ? Example : having σ = 2 Itemset Prime Numbers T id A C 7, 3 21 B C E 2, 3, 5 30 A B C E 7, 2, 3, 5 210 B E 2, 5 10 A B C E 7, 2, 3, 5 210 A-Conditional-context Itemset Prime Numbers T id C E 3, 5 30 C E 3, 5 30 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 B C E A 7, 2, 3, 5 210 B E 2, 5 10 B C E A 7, 2, 3, 5 210 Itemset Prime Numbers T id C 3 3 B C E 2, 3, 5 30 B C E 2, 3, 5 30 AB-Conditional-context Remove «B»
  • 12. DCIM algorithm : CFI Mining “Map Phase” 12 Map Inputs : T id Processing Map Outputs {C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3 {B C E} = 30 30 = 2 × 3 × 5 6 = 2 × 3 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {B C E A} = 210 210 = 2 × 3 × 5 × 7 30 = 2 × 3 × 5 6 = 2 × 3 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {B E} = 10 10 = 2 × 5 {E} = 5 : {B} = 2 {B C E A} = 210 210 = 2 × 3 × 5 × 7 30 = 2 × 3 × 5 6 = 2 × 3 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 A B C E 2, 3, 5, 7 210 B E 2, 5 10 A B C E 2, 3, 5, 7 210 Map Inputs : T id Processing Map Outputs {C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
  • 13. DCIM algorithm : CFI Mining “Reduce Phase” 13 no superset of the itemset in question that has the same support count, GCD calculations Example : A-Conditional-context : {7} 3 30 30 Output : { 3 × 7 = 21 } → A C GCD = 3 6 6 2 6 Output : { 5 × 2 = 10 } → B E E-Conditional-context : {5} GCD = 2 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 B C E A 2, 3, 5, 7 210 B E 2, 5 10 B C E A 2, 3, 5, 7 210
  • 14. 14 DCIM algorithm : CFI Mining “Reduce Phase” Map Outputs {A} = 7 : {C} = 3 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {E} = 5 : {B} = 2 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 Reduce Inputs CFI Mining → Reduce Outputs {A} = 7 : {3, 30, 30} {AB} = 14 : {15, 15} {AE} ? → {AE} ⊂ {ABCE} GCD(3, 30, 30) = 3 = C → 3 x 7 = 21 : {AC} is CFI GCD(15,15) = 15 = CE → 15 x 14 = 210 : {ABCE} is CFI STOP ! {E} = 5 : {6, 6, 2, 6} {EC} = 15 : {2, 2, 2} GCD(6, 6, 2, 6) = 2 = B → 2 x 5 = 10 : {BE} is CFI GCD(2, 2, 2) = 2 → 2 x 15 = 30 : {BCE} is CFI {C} = 3 : {7, 2, 2, 2} GCD(7, 2, 2, 2) = 1 → 3 = C : {C} is CFI CFIs = {AC, ABCE, BE, BCE}
  • 15. 15 Experimental Results : Datasets • Wikipedia Articles • Each line mimics a research article, • 7,892,123 transactions with 6,853,616 items, • Maximal length of a transaction is 153,953, • Clue Web • One billion web pages in ten languages, • 53,268,952 transactions with 11,153,752 items, • Maximal length of a transaction is 689,153,
  • 16. 16 Experimental Results : Setup and implementation • One of the clusters of Grid5000 • 32 nodes equipped with Hadoop 2.6.0 version, • 96 Gigabytes Ram, • 2,9 to 3,9 Ghz Processors, • Java and Openjdk-7-jdk. • Compared to a basic implementation of CLOSET algorithm in MapReduce and the parallel FP-growth. • Execution time and speedup for multiple values of σ.
  • 19. Conclusion • Big data : game changing revolution !! 19
  • 20. Conclusion • A reliable and efficient parallel algorithm for CFI mining namely DCIM, • DCIM shows significantly better performances than approaches from the state of the art, • An efficient data modeling : Prime numbers processings ! → The approach is effective and efficient • CFI mining in data streams 20