Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Numberofcombinations
Number of items in set
8
256
Numberofcombinations
Number of items in set
8 20
256
1,048,576
Numberofcombinations
Number of items in set
8 20 140,000
256
1,048,576
???
Theory Meets Reality
Large Scale Frequent Pattern Mining with Apache Spark in the Real World
Kexin Xie, Architect of Marke...
Marketing Cloud Einstein Journey Insights
Track the entire consumer journey
Gather online and offline interactions to stit...
What is
Frequent Pattern
Mining
Mine Shaft Mural Painting by Frank Wilson
a b c d e
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b...
item support
a 8
b 7
c 6
d 5
e 3
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7...
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c...
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
Min Support = 4
User Items
u-1 a, b
u-2 ...
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
Min Support = 4
item support
a 8
b 7
c 6...
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
Min Support = 4
item support
a 8
b 7
c 6...
A-priori Principle
A Priori in Berkeley, CA
“All sub-patterns of a frequent pattern are
frequent”
Min Support = 4
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b ?
a, c ?
a, d ?
a, e ?
... ...
Min Support = 4
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b ?
a, c ?
a, d ?
a, e ?
... ...
Min Support = 6
item sup...
FP-Growth
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
c 6
FP Results
item support
a 8
b 7
c 6
root
a: 8 b: 2
b: 5
c: 3 c: 1
c: 2
a, b, c 3
a, c 1
b, c 2
c 6
FP Results
Header Table
item support
a 8
b 7
c 6
root
a: 8 b: 2
b: 5
c: 3 c: 1
c: 2
a, b, c 3
a, c 1
b, c 2
c 6
FP Results
Header Table
item support
a 8
b 7
c 6
root
a: 8 b: 2
b: 5
c: 3 c: 1
c: 2
a, b, c 3
a, c 1
b, c 2
c 6
FP Results
Header Table
item support
a 8
b 7
c 6
root
a: 8 b: 2
b: 5
c: 3 c: 1
c: 2
a, b, c 3
a, c 1
b, c 2
c 6
FP Results
Header Table
FP-Tree | c
a, b 3
a 1
b 2
item support
b 5
a 4
root
b: 5
a: 4
c 6
FP Results
Header Table
FP-Tree | c
a, b 3
a 1
b 2
item support
b 5
a 4
root
b: 5
a: 4
c 6
FP Results
Header Table
c 6
item support
b 5
a 4
root
b: 5
a: 4
a 4
b 5
a, b 4
FP-Tree | c
FP Results
Header Table
c 6
a, c 4
b, c 5
a, b, c 4
item support
b 5
a 4
root
b: 5
a: 4
a 4
b 5
a, b 4
FP-Tree | c
FP Results
Header Table
Scaling Up
https://www.firestock.ru/strela-na-grafike-arrow-on-the-chart/
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
User Items
u-1...
Number of rows
Numberofitems
Number of rows
Numberofitems
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
item support
a 8
b 7
c 6
d 5
e 3
user items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7...
item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
Header Table
user items
u-1 a, b
u-2 b, c...
item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
u-2 b, c, (d)
u-3 a, c, (d, e)
u-5 a, b, ...
item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
u-2 b, c, (d)
u-3 a, c, (d, e)
u-5 a, b, ...
item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
u-2 b, c, d
u-3 a, c, d, (e)
u-4 a, d, (e...
item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
u-1 a, b
u-2 b, (c, d)
u-5 a, b, (c)
u-6 ...
Number of rows
Numberofitems
u-1 a, b
u-2 b, (c, d)
u-5 a, b, (c)
u-6 a, b, (c, d)
u-8 a, b, (c)
u-9 a, b, (d)
u-10 b, (c,...
Distribute rows to executors
Build FP-Trees on each node
and mine for patterns
Collect patterns
Build FP-tree header table
Distribute rows to executors
Build FP-Trees on each node
and mine for patterns
Collect patterns
val headerTable = data
.fl...
Minimum support
https://www.maxpixel.net/static/photo/1x/Cogs-Gears-Technical-Wh
eel-Cogwheel-Gearwheel-2279289.jpg
Differential Minimum Support (DMS)
Classify Items Into
Categories
Compute Min Support
Per Category
Run FP with Multiple
Mi...
COMMON ITEMS
RARE ITEMS
Pattern Frequency Test
CONDITION 1: Pattern Support ≥ Pattern Min Support
Pattern min support is defined as the lowest cat...
Condition 1: Pattern Support > Pattern Minimum Support
Pattern Frequency Test
Item Cat Minsup Condition 1
A Common 100k
B ...
Condition 1: Pattern Support > Pattern Minimum Support
Pattern Frequency Test
Item Cat Minsup Condition 1
A Common 100k
B ...
Condition 1: Pattern support > Lowest minsup given all items in the pattern
Pattern Frequency
Item Cat Minsup Condition 1
...
Condition 2 - A priori principle
Pattern Frequency Test
Item Cat Minsup Condition 1
A Common 100k
B Common 100k
C Rare 1k
...
val fpTreeResults = data
.flatMap(filterDataBasedHeaderTable(headerTable))
.groupByKey
.flatMap { case (k, rows) =>
mineFo...
val catMinsupMap = sc.broadcast( computeCatMinSup (data))
val fpTreeResults = data
.flatMap(filterDataBasedHeaderTable(hea...
val catMinsupMap = sc.broadcast( computeCatMinSup (data))
val fpTreeResults = data
.flatMap(filterDataBasedHeaderTable(hea...
Not the end of the story ...
https://w-dog.net/wallpaper/nature-night-star-tree-trees-stars-background-wal
lpaper-widescre...
Low Level Optimization
• Handled case where array length > Integer.MAX_VALUE
Result Set Compaction
• Remove redundant and ...
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu
Nächste SlideShare
Wird geladen in …5
×

Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu

70 Aufrufe

Veröffentlicht am

Salesforce Einstein is the artificial intelligence layer that delivers predictions and recommendations based on the customer’s unique business processes and data. Einstein Journey Insight is one of the key product offered by Salesforce DMP to help marketers and publishers leverage AI to analyze billions of touchpoints across consumer journeys and discover the optimal paths to conversion, including insights about which channels, messages, and events perform best.

To understand how consumers engage with website articles, advertising campaigns, social events, products and how that essentially leads to a conversion, analysts need to identify key events among thousands of events per user. Frequent pattern mining is the key technique for solving such problems. We have all heard about the beer and diaper story for mining consumer buying habits, however, at Salesforce DMP, we see over 3.5 billion unique users globally a month, across sites, media, mobile app, transactional, and offline, traffic sources. That is more than Facebook, Wikipedia and Twitter combined. The sheer volume, the heterogeneous nature of events and their metadata offer unique opportunities to analyze the complete consumer journey.

However, it also makes it extra challenging to interpret the results or even run the frequent pattern algorithm cost effectively. In this talk, we are going to share our experience of running large scale frequent pattern mining operations using Apache Spark in our Einstein Journey Insight product. We will examine the practicality of the Frequent Pattern technique, and show how Spark helps us address the scaling problem, deal with diverse metadata, and generate interpretable and actionable insights.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Theory Meets Reality—Large Scale Frequent Pattern Mining with Apache Spark in the Real World with Kexin Xie and Wanderley Liu

  1. 1. Numberofcombinations Number of items in set 8 256
  2. 2. Numberofcombinations Number of items in set 8 20 256 1,048,576
  3. 3. Numberofcombinations Number of items in set 8 20 140,000 256 1,048,576 ???
  4. 4. Theory Meets Reality Large Scale Frequent Pattern Mining with Apache Spark in the Real World Kexin Xie, Architect of Marketing Cloud Einstein kexin.xie@salesforce.com, @realstraw Wanderley Liu, Senior Data Science Engineer wanderley.liu@salesforce.com
  5. 5. Marketing Cloud Einstein Journey Insights Track the entire consumer journey Gather online and offline interactions to stitch together a complete view of the consumer Discover the optimal path to conversion Use AI to analyze all journey permutations and automatically recommend the best channels, offers and sequences that lead to conversion Learn how customers are actually interacting with your brand GA
  6. 6. What is Frequent Pattern Mining Mine Shaft Mural Painting by Frank Wilson
  7. 7. a b c d e
  8. 8. User Items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e
  9. 9. item support a 8 b 7 c 6 d 5 e 3 User Items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e
  10. 10. item support a 8 b 7 c 6 d 5 e 3 item support a, b 5 a, c 4 a, d 4 a, e 2 ... ... User Items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e
  11. 11. item support a 8 b 7 c 6 d 5 e 3 item support a, b 5 a, c 4 a, d 4 a, e 2 ... ... Min Support = 4 User Items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e
  12. 12. item support a 8 b 7 c 6 d 5 e 3 item support a, b 5 a, c 4 a, d 4 a, e 2 ... ... Min Support = 4 item support a 8 b 7 c 6 d 5 e 3 item support a, b 5 a, c 4 a, d 4 a, e 2 ... ... User Items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e
  13. 13. item support a 8 b 7 c 6 d 5 e 3 item support a, b 5 a, c 4 a, d 4 a, e 2 ... ... Min Support = 4 item support a 8 b 7 c 6 d 5 e 3 item support a, b 5 a, c 4 a, d 4 a, e 2 ... ... User Items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e L1 Patterns L2 Patterns
  14. 14. A-priori Principle A Priori in Berkeley, CA “All sub-patterns of a frequent pattern are frequent”
  15. 15. Min Support = 4 item support a 8 b 7 c 6 d 5 e 3 item support a, b ? a, c ? a, d ? a, e ? ... ...
  16. 16. Min Support = 4 item support a 8 b 7 c 6 d 5 e 3 item support a, b ? a, c ? a, d ? a, e ? ... ... Min Support = 6 item support a 8 b 7 c 6 d 5 e 3 item support a, b ? a, c ? a, d ? a, e ? ... ...
  17. 17. FP-Growth
  18. 18. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  19. 19. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  20. 20. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  21. 21. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  22. 22. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  23. 23. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  24. 24. c 6 FP Results
  25. 25. item support a 8 b 7 c 6 root a: 8 b: 2 b: 5 c: 3 c: 1 c: 2 a, b, c 3 a, c 1 b, c 2 c 6 FP Results Header Table
  26. 26. item support a 8 b 7 c 6 root a: 8 b: 2 b: 5 c: 3 c: 1 c: 2 a, b, c 3 a, c 1 b, c 2 c 6 FP Results Header Table
  27. 27. item support a 8 b 7 c 6 root a: 8 b: 2 b: 5 c: 3 c: 1 c: 2 a, b, c 3 a, c 1 b, c 2 c 6 FP Results Header Table
  28. 28. item support a 8 b 7 c 6 root a: 8 b: 2 b: 5 c: 3 c: 1 c: 2 a, b, c 3 a, c 1 b, c 2 c 6 FP Results Header Table
  29. 29. FP-Tree | c a, b 3 a 1 b 2 item support b 5 a 4 root b: 5 a: 4 c 6 FP Results Header Table
  30. 30. FP-Tree | c a, b 3 a 1 b 2 item support b 5 a 4 root b: 5 a: 4 c 6 FP Results Header Table
  31. 31. c 6 item support b 5 a 4 root b: 5 a: 4 a 4 b 5 a, b 4 FP-Tree | c FP Results Header Table
  32. 32. c 6 a, c 4 b, c 5 a, b, c 4 item support b 5 a 4 root b: 5 a: 4 a 4 b 5 a, b 4 FP-Tree | c FP Results Header Table
  33. 33. Scaling Up https://www.firestock.ru/strela-na-grafike-arrow-on-the-chart/
  34. 34. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 User Items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e Header Table
  35. 35. Number of rows Numberofitems
  36. 36. Number of rows Numberofitems
  37. 37. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  38. 38. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  39. 39. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  40. 40. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  41. 41. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  42. 42. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  43. 43. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  44. 44. item support a 8 b 7 c 6 d 5 e 3 root a: 8 b: 2 b: 5 c: 3 d: 1 d: 1 c: 1 d: 1 e: 1 c: 2 d: 1 d: 1 e: 1 e: 1 Header Table
  45. 45. item support a 8 b 7 c 6 d 5 e 3 user items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e Header Table
  46. 46. item support a 8 b 7 c 6 d 5 e 3 a [a], b [a, b], c [a, b, c], d [a, b, c, d], e Header Table user items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e
  47. 47. item support a 8 b 7 c 6 d 5 e 3 a [a], b [a, b], c [a, b, c], d [a, b, c, d], e u-2 b, c, (d) u-3 a, c, (d, e) u-5 a, b, c u-6 a, b, c, (d) u-8 a, b, c u-10 b, c, (e) Header Table user items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e
  48. 48. item support a 8 b 7 c 6 d 5 e 3 a [a], b [a, b], c [a, b, c], d [a, b, c, d], e u-2 b, c, (d) u-3 a, c, (d, e) u-5 a, b, c u-6 a, b, c, (d) u-8 a, b, c u-10 b, c, (e) Header Table user items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e u-3 a, c, d, e u-4 a, d, e u-10 b, c, e
  49. 49. item support a 8 b 7 c 6 d 5 e 3 a [a], b [a, b], c [a, b, c], d [a, b, c, d], e u-2 b, c, d u-3 a, c, d, (e) u-4 a, d, (e) u-6 a, b, c, d u-9 a, b, d Header Table u-3 a, c, d, e u-4 a, d, e u-10 b, c, e user items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e u-2 b, c, (d) u-3 a, c, (d, e) u-5 a, b, c u-6 a, b, c, (d) u-8 a, b, c u-10 b, c, (e)
  50. 50. item support a 8 b 7 c 6 d 5 e 3 a [a], b [a, b], c [a, b, c], d [a, b, c, d], e u-1 a, b u-2 b, (c, d) u-5 a, b, (c) u-6 a, b, (c, d) u-8 a, b, (c) u-9 a, b, (d) u-10 b, (c, e) Header Table user items u-1 a, b u-2 b, c, d u-3 a, c, d, e u-4 a, d, e u-5 a, b, c u-6 a, b, c, d u-7 a u-8 a, b, c u-9 a, b, d u-10 b, c, e u-2 b, c, (d) u-3 a, c, (d, e) u-5 a, b, c u-6 a, b, c, (d) u-8 a, b, c u-10 b, c, (e) u-2 b, c, d u-3 a, c, d, (e) u-4 a, d, (e) u-6 a, b, c, d u-9 a, b, d u-3 a, c, d, e u-4 a, d, e u-10 b, c, e
  51. 51. Number of rows Numberofitems u-1 a, b u-2 b, (c, d) u-5 a, b, (c) u-6 a, b, (c, d) u-8 a, b, (c) u-9 a, b, (d) u-10 b, (c, e) u-2 b, c, (d) u-3 a, c, (d, e) u-5 a, b, c u-6 a, b, c, (d) u-8 a, b, c u-10 b, c, (e) u-2 b, c, d u-3 a, c, d, (e) u-4 a, d, (e) u-6 a, b, c, d u-9 a, b, d u-3 a, c, d, e u-4 a, d, e u-10 b, c, e
  52. 52. Distribute rows to executors Build FP-Trees on each node and mine for patterns Collect patterns Build FP-tree header table
  53. 53. Distribute rows to executors Build FP-Trees on each node and mine for patterns Collect patterns val headerTable = data .flatMap(_.items.map(_ -> 1L)) .reduceByKey(_ + _) .filter(isFrequent) .collect .sorted data .flatMap(filterDataBasedHeaderTable (headerTable)) .groupByKey .flatMap { case (k, rows) => mineForPatternsFor (k, rows) } .collect // If necessary Build FP-tree header table
  54. 54. Minimum support https://www.maxpixel.net/static/photo/1x/Cogs-Gears-Technical-Wh eel-Cogwheel-Gearwheel-2279289.jpg
  55. 55. Differential Minimum Support (DMS) Classify Items Into Categories Compute Min Support Per Category Run FP with Multiple Min Supports
  56. 56. COMMON ITEMS RARE ITEMS
  57. 57. Pattern Frequency Test CONDITION 1: Pattern Support ≥ Pattern Min Support Pattern min support is defined as the lowest category minsup, given all items in the pattern CONDITION 2 - Apriori Principle (Recursive) If a pattern is frequent, all sub-patterns must be frequent
  58. 58. Condition 1: Pattern Support > Pattern Minimum Support Pattern Frequency Test Item Cat Minsup Condition 1 A Common 100k B Common 100k C Rare 1k Pattern Support Minsup Condition 1 A B 80k A C 4k B C 3k A B C 2k
  59. 59. Condition 1: Pattern Support > Pattern Minimum Support Pattern Frequency Test Item Cat Minsup Condition 1 A Common 100k B Common 100k C Rare 1k Pattern Support Minsup Condition 1 A B 80k 100k A C 4k 1k B C 3k 1k A B C 2k 1k
  60. 60. Condition 1: Pattern support > Lowest minsup given all items in the pattern Pattern Frequency Item Cat Minsup Condition 1 A Common 100k B Common 100k C Rare 1k Pattern Support Minsup Condition 1 A B 80k 100k A C 4k 1k B C 3k 1k A B C 2k 1k
  61. 61. Condition 2 - A priori principle Pattern Frequency Test Item Cat Minsup Condition 1 A Common 100k B Common 100k C Rare 1k Pattern Support Minsup Condition 2 A B 80k 100k A C 4k 1k B C 3k 1k A B C 2k 1k
  62. 62. val fpTreeResults = data .flatMap(filterDataBasedHeaderTable(headerTable)) .groupByKey .flatMap { case (k, rows) => mineForPatternsFor (k, rows) }
  63. 63. val catMinsupMap = sc.broadcast( computeCatMinSup (data)) val fpTreeResults = data .flatMap(filterDataBasedHeaderTable(headerTable)) .groupByKey .flatMap { case (k, rows) => mineForPatternsFor (k, rows, catMinsupMap.value ) } CONDITION 1
  64. 64. val catMinsupMap = sc.broadcast( computeCatMinSup (data)) val fpTreeResults = data .flatMap(filterDataBasedHeaderTable(headerTable)) .groupByKey .flatMap { case (k, rows) => mineForPatternsFor (k, rows, catMinsupMap.value ) } val patternsMap = sc.broadcast(fpTreeResults.keys.collect) fpTreeResults .filter { case (pattern, support) => pattern.subsets.subsetOf (patternMap.value) } CONDITION 1 CONDITION 2
  65. 65. Not the end of the story ... https://w-dog.net/wallpaper/nature-night-star-tree-trees-stars-background-wal lpaper-widescreen-full-screen-hd-wallpapers-fullscreen/id/308950/
  66. 66. Low Level Optimization • Handled case where array length > Integer.MAX_VALUE Result Set Compaction • Remove redundant and noisy result sets • Very efficient compaction - 95% without loss of information Result Set Ranking • Score patterns with multiple criteria Items with Feature Set • Not only which combinations work best, but what makes them work best • Well received feature, direct feedback on strategy

×