2. WHAT IS ASSOCIATION MINING?
การค้นหาความสัมพันธ์ระหว่าง RECORD หรือกลุ่มของ RECORD ใน DATABASE
AssociationRule Mining
Algorithms for scalable mining of (single-dimensional Boolean) associationrules in
transactional databases
Objective
Search for interesting relationships among items in a given data set
AssociationRule
Antecedent → Consequent
Example:
{Diaper} → {Beer}
7. HOW ARE ASSOCIATION RULES MINED?
There are two-step process:
1. Find all frequent itemsets: by definition, each of these
itemsets will occur at least as frequently as a pre-
determined minimum support count
2. Generate strong association rules from the frequent
itemsets: by definition, these rules must satisfy
minimum support and minimum confidence
18. ขั้นตอนในหากฎความสัมพันธ์
มี 2 ขั้นตอนหลัก คือ
1) Find all frequent itemsets
หมายถึงหารายการสินค้าที่เกิดขึ้นบ่อยทั้งหมดก่อนโดยกฎที่ได้มานั้นต้อง
มากกว่าค่า support ตํ่าที่สุดที่กําหนดไว้
2) Generate strong association rules from the frequent itemsets
หมายถึงหากฎความสัมพันธ์ที่แข็งแกร่ง โดยกฎที่ได้มานั้นต้องมากกว่าค่า
Support ตํ่าที่สุดที่กําหนดไว้เรียกว่า Min_sup และมากกว่าค่า Confidence ตํ่าที่สุด
เรียกว่า Min_Conf ที่กําหนดไว้ด้วย
19. แนวความคิดพื้นฐานสําหรับการหากฎความสัมพันธ์
Itemset is a set of items
Let I = {i1, …, ik}
An Association rule XY where X ⊂ I, Y ⊂ I
Find all the rules XY with min confidence and support called strong
association rules
support, s, probability that a transaction contains X ∪ Y
confidence, c, conditional probability that a transaction having X also
contains Y.
support (XY) = P(X ∪ Y)
confidence(XY) = P(Y|X) = P(X ∪ Y) / P(x)
20. ตัวอย่างเช่น
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
{A} 75%
{B} 50%
{C} 50%
{A, C} 50%
Frequently itemsets1
A → C support = 50%
C → A support = 50%
confidence of A → C = P(A∪C)/P(A) = (1/2) / (3/4) = (1/2)*(4/3) = 66.67%
confidence of C → A =P(C ∪ A)/P(C) =(1/2) / (1/2) = (1/2)*(2) = 100%
C → A is an exact rule
2
Min_Sup = 50%
Min_Conf = 60%
22. วิธี APRIORI
ขั้นตอนที่ 1 Find all frequent itemsets
หมายถึงหารายการสินค้าที่เกิดขึ้นบ่อยทั้งหมดก่อน
a b c
ab bc
abc
1-itemset
2-itemset
3-itemset
k-itemset
…
bc
23. วิธี APRIORI
ขั้นตอนที่ 1 Find all frequent itemsets
หมายถึงหารายการสินค้าที่เกิดขึ้นบ่อยทั้งหมดก่อนแต่ถ้าหาก
รายการใดไม่ผ่านค่า min_sup รายการนั้นที่หา การเกิดขึ้นบ่อยกับรายการ
อื่นจะไม่ผ่านค่า min_sup ด้วย
a b c
ab bc
abc
bc
{c} = min_sup = 30%
min_sup threshold = 50%
32. PREDICT
Predicted Label
Positive Negative
Known Label
Positive
True Positive
(TP)
False Negative
(FN)
Negative
False Positive
(FP)
True Negative
(TN)
For simplicity, the assumption is that each instance can only be
assigned
one of two classes: Positive or Negative (e.g. a patient's tumor may be
malignant or benign). Each instance (e.g. a patient) has a Known label,
and a Predicted label. Some method is used (e.g. cross-validation) to
make predictions on each instance. Each instance then increments one
cell in the confusion matrix.
33. PREDICT
Measure Formula Intuitive Meaning
Precision TP / (TP + FP)
The percentage of positive predictions that are
correct.
Recall / Sensitivity TP / (TP + FN)
The percentage of positive labeled instances
that were predicted as positive.
Specificity TN / (TN + FP)
The percentage of negative labeled instances
that were predicted as negative.
Accuracy (TP + TN) / (TP + TN + FP + FN) The percentage of predictions that are correct.
For simplicity, the assumption is that each instance can only be
assigned
one of two classes: Positive or Negative (e.g. a patient's tumor may be
malignant or benign). Each instance (e.g. a patient) has a Known label,
and a Predicted label. Some method is used (e.g. cross-validation) to
make predictions on each instance. Each instance then increments one
cell in the confusion matrix.
34. PREDICT
This seems to suggest that, without any knowledge of the distribution of
data, the best measures to use are Recall (Sensitivity) and Specificity to
allow one to find problems with classifiers. However, many other cases
can arise other than these four boundary cases. Consider the following
confusion matrix for a data set with 600 out of 11,100 instances positive:
Predicted Label
Positive Negative
Known Label
Positive 500 100
Negative 500 10,000
36. IN THIS CASE, PRECISION = 50%, RECALL = 83%, SPECIFICITY = 95%, AND
ACCURACY = 95%. IN THIS CASE, PRECISION IS LOW, WHICH MEANS THE
CLASSIFIER IS PREDICTING POSITIVES POORLY. HOWEVER, THE THREE OTHER
MEASURES SEEM TO SUGGEST THAT THIS IS A GOOD CLASSIFIER. THIS JUST
GOES TO SHOW THAT THE PROBLEM DOMAIN HAS A MAJOR IMPACT ON THE
MEASURES THAT SHOULD BE USED TO EVALUATE A CLASSIFIER WITHIN IT,
AND THAT LOOKING AT THE 4 SIMPLE CASES PRESENTED IS NOT SUFFICIENT.
37.
38. แบบฝึกหัดบทที่ 4
What is Association Mining?
What is objective of Association Mining?
How many type of Association Mining rule?
How many step of Association Mining?
39. แบบฝึกหัดบทที่ 4
Consider the database in the figure below and
assume the minimum support is 3 transactions.
41. LAB 4
Apriori works with categorical values only. Therefore, if a dataset
contains numeric attributes, they need to be converted into nominal before
applying the Apriori algorithm. Hence, data preprocessing must be
performed. Repeat LAB 3 (Data Preprocessing), if you don’t know how to
deal with numeric to nominal conversion.
weather.nominal.arff
bank-data.arff
market‐basket.arff