04 association

สอนโดย ดร.หทัยรัตน์ เกตุมณีชัยรัตน์
ภาควิชาการจัดการเทคโนโลยีการผลิตและสารสนเทศ
บทที่ 4 : การหากฎความสัมพันธ์ของข้อมูล
(Association)
336331 การทําเหมืองข้อมูล (Data Mining)

WHAT IS ASSOCIATION MINING?
การค้นหาความสัมพันธ์ระหว่าง RECORD หรือกลุ่มของ RECORD ใน DATABASE
 AssociationRule Mining
 Algorithms for scalable mining of (single-dimensional Boolean) associationrules in
transactional databases
 Objective
 Search for interesting relationships among items in a given data set
 AssociationRule
 Antecedent → Consequent
 Example:
{Diaper} → {Beer}

การทําเหมืองข้อมูลสัมพันธ์ (ASSOCIATION MINING)
 จุดประสงค์เพื่อ
 ค้นหากฎและวิเคราะห์ความสัมพันธ์ระหว่างไอเท็มซึ่งอยู่ใน
เซตของสินค้า
 ส่วนใหญ่จะทําเพื่อ การวิเคราะห์ตลาด เรียกว่า
“การวิเคราะห์ตะกร้าการซื้อ”
(Market Basket Analysis)

การวิเคราะห์ตลาด/ตะกร้าการซื้อ
ในธุรกิจการขาย
ลูกค้า คือ ผู้ซื้อสินค้าหรือบริการจากบริษัท ซึ่งลูกค้าอาจะ
มีพฤติกรรมการซื้อที่แตกต่างกันออกไป
ผู้บริหารต้องการวิเคราะห์ข้อมูลการซื้อ หรือ ขายสินค้าที่
เกิดพร้อมกันหรือไม่เกิดพร้อมกัน

 การวิเคราะห์ ผู้บริหารต้องการทราบ
 พฤติกรรมของการซื้อขายที่ไม่เจาะจงกับลูกค้าคนหนึ่งคนใด
 ลักษณะของสินค้าหรือผลิตภัณฑ์ที่ถูกซื้อและขายคู่กัน
 ลักษณะการซื้อขายที่แตกต่างกับสินค้าอื่น
 การวางแผนลักษณะการดําเนินการธุรกิจจากความรู้ที่ได้
 ประยุกต์เพื่อเพิ่มยอดขาย
 การวางผังภายในร้าน
 การนําสินค้ามาลด แลก แจก แถม ออกโปรโมชั่น

 Catalog design
 Product pricing and
promotion Cross-market
 Store Layout

HOW ARE ASSOCIATION RULES MINED?
 There are two-step process:
1. Find all frequent itemsets: by definition, each of these
itemsets will occur at least as frequently as a pre-
determined minimum support count
2. Generate strong association rules from the frequent
itemsets: by definition, these rules must satisfy
minimum support and minimum confidence

ASSOCIATION MINING RULE
 การวิเคราะห์ตะกร้าการซื้อ ใช้เทคนิค ที่เรียกว่า กฎการทําเหมืองความความสัมพันธ์
(Association Mining Rule)
 ที่เน้นข้อมูล จาก Poin-of-Sale (P-O-S)
 ข้อมูลที่นํามาใช้วิเคราะห์อยู่ในรูป transaction

ตัวอย่างข้อมูล TRANSACTION
รหัสการซื้อหนึ่งใบเสร็จ
ข้อมูลลักษณะลูกค้า (อาจมี)
ปริมาณสินค้าที่ซื้อ
ข้อมูลประเภทของสินค้าที่ขาย
จํานวนเงิน

นิยามของการทําเหมืองความสัมพันธ์
 การทําเหมืองความสัมพันธ์ คือ การมาซึ่งกฎความสัมพันธ์โดยเป็นการหา
รูปแบบที่เกิดขึ้นบ่อยคู่กัน เรียกว่า frequent pattern และความสัมพันธ์ที่
เกิดขึ้น เรียกว่า association ของกลุ่มไอเท็มจากข้อมูลที่อยู่ในรูป transaction
 ผลลัพธ์ที่ได้อยู่ในรูปกฎความสัมพันธ์
 และเขียนค่าเปอร์เซ็น Support และ Confidence กํากับด้วย
item1 → item2
[support, confidence]

ประเภทรูปกฎการทําเหมืองความสัมพันธ์
 แบบ “Single-dimensional association rules”
computer → software
[support = 1%, confidence = 50%]
 แบบ “Multidimensional association rule”
Computer, Book → software
หรือ
Age(X, “20..29”) ∧ income (X, “20K..29K”) → buys (X, “CD
player”)

ตัวอย่างของการเขียนกฎความสัมพันธ์ (1)
 ต้องการเขียน กฎ ซึ่งแสดงความสัมพันธ์ระหว่าง item เช่น บริษัทแห่งหนึ่งมี
บริการตรวจรับรถ ล้างอัดฉีด ขายอุปกรณ์ตกแต่งรถ เป็นต้น จากการทําเหมือง
กฎความสัมพันธ์ ได้กฎดังต่อไปนี้ คือ
 ลูกค้าที่มาชอบรับบริการตรวจรถและมีการซื้ออุปกรณ์ตกแต่งรถ เท่ากับ 30%
รับบริการตรวจรถ → ซื้ออุปกรณ์ตกแต่งรถ
[30%, confidence]

 ลูกค้าที่มารับบริการตรวจรถแล้วมีการซื้ออุปกรณ์ตกแต่งรถเพิ่มเติมด้วย
เท่ากับ 50%
[30%, confidence]
[30%, 50%]

 ต้องการเขียน กฎ ซึ่งแสดงความสัมพันธ์ระหว่าง item การซื้อสินค้าของร้านค้า
แห่งหนึ่ง จากการทําเหมืองกฎความสัมพันธ์ ได้กฎดังต่อไปนี้ คือ
 ร้านค้าพบว่าลูกค้าชอบซื้อทั้งเบียร์ (Beer) และ ผ้าอ้อม (Pamper) คิดเป็น 80%
ซื้อเบียร์ (Beer) → ซื้อผ้าอ้อม (Pamper)
[80%, confidence]

 จาการวิเคราะห์ลูกค้าเข้ามาซื้อของในร้านถ้าซื้อเบียร์แล้วส่วนใหญ่ต้องซื้อ
ผ้าอ้อมด้วยเท่ากับ 50%
[80%, confidence]
[80%, 50%]

 ต้องการเขียน กฎ ซึ่งแสดงความสัมพันธ์ระหว่าง item การซื้อสินค้าของร้านค้าแห่ง
หนึ่ง จากการทําเหมืองกฎความสัมพันธ์ ได้กฎดังต่อไปนี้ คือ
 ร้านค้าพบว่าลูกค้าอายุ 20 ถึง 29 ปีและรายได้ประมาณ 20,000 ถึง 29000 ชอบซื้อ
CD player 2%
Age(X, “20..29”) ∧ income (X, “20K..29K”) → buys (X, “CD player”)
[2%, confidence]

 จากการวิเคราะห์ลูกค้าเข้ามาซื้อของในร้านถ้าอายุ 20 ถึง 29 ปีและรายได้
ประมาณ 20,000 ถึง 29000 แล้วจะซื้อ CD player ด้วยเท่ากับ 60%
[2%, confidence]
[2%, 60%]

ขั้นตอนในหากฎความสัมพันธ์
 มี 2 ขั้นตอนหลัก คือ
1) Find all frequent itemsets
หมายถึงหารายการสินค้าที่เกิดขึ้นบ่อยทั้งหมดก่อนโดยกฎที่ได้มานั้นต้อง
มากกว่าค่า support ตํ่าที่สุดที่กําหนดไว้
2) Generate strong association rules from the frequent itemsets
หมายถึงหากฎความสัมพันธ์ที่แข็งแกร่ง โดยกฎที่ได้มานั้นต้องมากกว่าค่า
Support ตํ่าที่สุดที่กําหนดไว้เรียกว่า Min_sup และมากกว่าค่า Confidence ตํ่าที่สุด
เรียกว่า Min_Conf ที่กําหนดไว้ด้วย

แนวความคิดพื้นฐานสําหรับการหากฎความสัมพันธ์
 Itemset is a set of items
 Let I = {i1, …, ik}
 An Association rule XY where X ⊂ I, Y ⊂ I
 Find all the rules XY with min confidence and support called strong
association rules
 support, s, probability that a transaction contains X ∪ Y
 confidence, c, conditional probability that a transaction having X also
contains Y.
support (XY) = P(X ∪ Y)
confidence(XY) = P(Y|X) = P(X ∪ Y) / P(x)

ตัวอย่างเช่น
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
{A} 75%
{B} 50%
{C} 50%
{A, C} 50%
Frequently itemsets1
 A → C support = 50%
 C → A support = 50%
 confidence of A → C = P(A∪C)/P(A) = (1/2) / (3/4) = (1/2)*(4/3) = 66.67%
 confidence of C → A =P(C ∪ A)/P(C) =(1/2) / (1/2) = (1/2)*(2) = 100%
 C → A is an exact rule
2
Min_Sup = 50%
Min_Conf = 60%

การหากฎความสัมพันธ์ด้วยวิธี APRIORI
ขั้นตอนการหาด้วยวิธี Apriori
 Input: ฐานข้อมูลแบบ transactions; กําหนดค่า min_sup และ
ค่า min_sup
 Output: ได้ไอเท็มที่เกิดขึ้นบ่อยจากฐานข้อมูล

วิธี APRIORI
ขั้นตอนที่ 1 Find all frequent itemsets
หมายถึงหารายการสินค้าที่เกิดขึ้นบ่อยทั้งหมดก่อน
a b c
ab bc
abc
1-itemset
2-itemset
3-itemset
k-itemset
…
bc

 ขั้นตอนที่ 1 Find all frequent itemsets
หมายถึงหารายการสินค้าที่เกิดขึ้นบ่อยทั้งหมดก่อนแต่ถ้าหาก
รายการใดไม่ผ่านค่า min_sup รายการนั้นที่หา การเกิดขึ้นบ่อยกับรายการ
อื่นจะไม่ผ่านค่า min_sup ด้วย
a b c
ab bc
abc
bc
{c} = min_sup = 30%
min_sup threshold = 50%

ตัวอย่าง วิธี APRIORI
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{B, C, E} 2
Itemset
{B, C, E}

 ขั้นตอนที่ 2 Generate strong association rules from the frequent itemsets
 Strong association rules หมายถึงต้องผ่านค่า
- minimum support
- minimum confidence
)(_sup
)(_sup
)|()(
Acountport
BAcountport
ABPBAconfidence
∪
==→

ตัวอย่าง STRONG ASSOCIATION RULES
• Suppose the data contain the frequent itemset l = {I1, I2, I5} What are the
association rules that can be generate from l?
• The nonempty subsets of l are
{I1, I2}, {I1,I5}, {I2, I5}, {I1}, {I2}, {I5}
• List its confidence:
TID List of item_IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
I1 ∧ I2 → I5 Confidence = 2/4 = 50%
I1 ∧ I5 → I2 Confidence = 2/2 = 100%
I2 ∧ I5 → I1 Confidence = 2/2 = 100%
I1 → I2∧ I5 Confidence = 2/6 = 33%
I2 → I1∧ I5 Confidence = 2/7 = 29%
I5 → I1∧ I2 Confidence = 2/2 = 100%
> Min_conf threshold= 70 %

การวิเคราะห์สหสัมพันธ์
 การวิเคราะห์สหสัมพันธ์ (Correlation Analysis หรือ Lift) หมายถึง ค่า
สหสัมพันธ์ที่บ่งบอกที่ความสัมพันธ์ที่น่าสนใจระหว่าง item ได้
 ทําไมถึงต้องมีการวิเคราะห์ค่าสหสัมพันธ์
 การใช้ค่า Support และ Confidence มีประโยชน์มากสําหรับหลายแอปพลิเคชัน
แต่ควรระวังว่าอาจจะทําให้เกิดความเข้าใจผิดในบางกฎได้

 นิยาม
 มาตรวัดความสัมพันธ์ระหว่างสองไอเท็มเซตที่แข็งแกร่งหรือไม่
 โดยที่ P(B|A) = P(A ∪ B)/P(A)
 ผลลัพธ์ที่ได้จากการวิเคราะห์สหสัมพันธ์ คือ
 ถ้าค่าน้อยกว่า 1 หมายถึงการที่เกิดไอเท็ม A ไม่ได้ส่งเสริมไอเท็ม B จริง
 ถ้าค่ามากกว่า 1 หมายถึงการที่เกิดไอเท็ม A ส่งเสริมไอเท็ม B จริง
 ถ้าค่าเท่ากับ 1 หมายถึง การเกิดไอเท็ม A ไม่ได้มีความสัมพันธ์แต่อย่างใดกับไอเท็ม
B คือเป็นอิสระกัน
)(
)|(
)()(
)(
,
BP
ABP
BPAP
BAP
corr BA =
∪
=

 play basketball → eat cereal [40%, 66.7%] เป็นกฎที่เกิดการเข้าใจผิด (Misleading)
 Sup = 2000/5000 = 40%, conf = 2000/3000 = 66.7%
 ซึ่งจํานวนเปอร์เซ็นต์ของนักเรียนที่กินซีเรียล (eating cereal) เท่ากับ 75%
(3750/5000) ซึ่งมากกว่า 66.7%.
 play basketball → not eat cereal [20%, 33.3%] น่าจะมีความถูกต้องมากกว่าแม้ว่าค่า
Support และ confidence ตํ่ากว่ากฎที่แล้ว
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000

 play basketball → eat cereal [40%, 66.7%]
 play basketball → not eat cereal [20%, 33.3%]
8.0
75.0*6.0
4.0
_,_ ==cerealeatbasketballplaycorr
33.1
25.0*6.0
2.0
__,_ ==cerealeatnotbasketballplaycorr
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000

ผลลัพธ์ของกฎการทําเหมืองความสัมพันธ์
 มีประโยชน์และเอาไปประยุกต์จริงได้
ตัวอย่าง เช่น
Buy {Diaper} → buy {Beer} (buy in Friday)
 ไม่ไร้สาระหรือเป็นสิ่งที่ทราบอยู่แล้ว
ตัวอย่าง เช่น ลูกค้าซื้อโปรโมชั่น 3 สายซ้อน แล้วซื้อรอสายด้วย
เนื่องจากการซื้อแบบนี้มีอยู่แล้วจึงไม่ต้องแสดงกฎพวกนี้อีก
BuyPro {3 way calling} → BuyPro {call waiting}
 ทราบเหตุและผลจากกฎที่ได้
ตัวอย่าง เช่น ขายวงแหวนห้องนํ้าในวันคริสมาส ขายดี ถ้าหากไม่ทราบ
เหตุผลก็ไม่สามารถนําไปใช้ได้
Sell {toilet ring} → Date {Christmas’s Day}

PREDICT
Predicted Label
Positive Negative
Known Label
Positive
True Positive
(TP)
False Negative
(FN)
Negative
False Positive
(FP)
True Negative
(TN)
For simplicity, the assumption is that each instance can only be
assigned
one of two classes: Positive or Negative (e.g. a patient's tumor may be
malignant or benign). Each instance (e.g. a patient) has a Known label,
and a Predicted label. Some method is used (e.g. cross-validation) to
make predictions on each instance. Each instance then increments one
cell in the confusion matrix.

PREDICT
Measure Formula Intuitive Meaning
Precision TP / (TP + FP)
The percentage of positive predictions that are
correct.
Recall / Sensitivity TP / (TP + FN)
The percentage of positive labeled instances
that were predicted as positive.
Specificity TN / (TN + FP)
The percentage of negative labeled instances
that were predicted as negative.
Accuracy (TP + TN) / (TP + TN + FP + FN) The percentage of predictions that are correct.
For simplicity, the assumption is that each instance can only be
assigned
one of two classes: Positive or Negative (e.g. a patient's tumor may be
malignant or benign). Each instance (e.g. a patient) has a Known label,
and a Predicted label. Some method is used (e.g. cross-validation) to
make predictions on each instance. Each instance then increments one
cell in the confusion matrix.

PREDICT
 This seems to suggest that, without any knowledge of the distribution of
data, the best measures to use are Recall (Sensitivity) and Specificity to
allow one to find problems with classifiers. However, many other cases
can arise other than these four boundary cases. Consider the following
confusion matrix for a data set with 600 out of 11,100 instances positive:
Predicted Label
Positive Negative
Known Label
Positive 500 100
Negative 500 10,000

PRECISION
TP / (TP + FP) = 500 / (500 + 500) = ½ = 0.5 = 50%
RECALL / SENSITIVITY
TP / (TP + FN) = 500 / (500 + 100) = 5/6 = 0.83 = 83%
SPECIFICITY
TN / (TN + FP) = 10000 /(10000 + 500) = 0.95 = 95%
ACCURACY
(TP + TN) / (TP + TN + FP + FN) = (500 + 10000) / (500+10000+500+100)= 0.95 = 95%
Predicted Label
Positive Negative
Known Label
Positive 500 (TP) 100 (FN)
Negative 500 (FP) 10,000 (TN)

IN THIS CASE, PRECISION = 50%, RECALL = 83%, SPECIFICITY = 95%, AND
ACCURACY = 95%. IN THIS CASE, PRECISION IS LOW, WHICH MEANS THE
CLASSIFIER IS PREDICTING POSITIVES POORLY. HOWEVER, THE THREE OTHER
MEASURES SEEM TO SUGGEST THAT THIS IS A GOOD CLASSIFIER. THIS JUST
GOES TO SHOW THAT THE PROBLEM DOMAIN HAS A MAJOR IMPACT ON THE
MEASURES THAT SHOULD BE USED TO EVALUATE A CLASSIFIER WITHIN IT,
AND THAT LOOKING AT THE 4 SIMPLE CASES PRESENTED IS NOT SUFFICIENT.

แบบฝึกหัดบทที่ 4
 What is Association Mining?
 What is objective of Association Mining?
 How many type of Association Mining rule?
 How many step of Association Mining?

 Consider the database in the figure below and
assume the minimum support is 3 transactions.

 From table below, Please calculate Recall, Specificity, Precision และ Accuracy

LAB 4
 Apriori works with categorical values only. Therefore, if a dataset
contains numeric attributes, they need to be converted into nominal before
applying the Apriori algorithm. Hence, data preprocessing must be
performed. Repeat LAB 3 (Data Preprocessing), if you don’t know how to
deal with numeric to nominal conversion.
 weather.nominal.arff
 bank-data.arff
 market‐basket.arff

04 association

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Mehr von นนทวัฒน์ บุญบา

Mehr von นนทวัฒน์ บุญบา (7)

04 association