Association discovery is an unsupervised learning technique that finds correlations or "if-then" rules in data without labels. It outputs rules with antecedents and consequents along with significance metrics like support, coverage, confidence, lift and leverage. Configuration options include the search strategy, minimum rule measures, handling of complementary/missing values, consequent filtering, and discretization of numeric data.
4. BigML Education Program 4Association Discovery
What is Association Discovery?
• An unsupervised learning technique
• No labels necessary
• Useful for data discovery
• Finds "significant" correlations
• Shopping cart: Coffee and sugar
• Medical: High plasma glucose and diabetes
• Expresses them as "if then rules"
• If "antecedent" then "consequent"
• Significance measures
5. BigML Education Program 5Association Discovery
Association Rules
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
Rules:
Antecedent Consequent
{customer = Bob, account = 3421}
{class = gas}
7. BigML Education Program 7Association Discovery
Association Metrics
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
8. BigML Education Program 8Association Discovery
Association Metrics
Support
Percentage of instances
which match antecedent
“A” and Consequent “C”
Instances
A
C
9. BigML Education Program 9Association Discovery
Association Metrics
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Coverage
Support
Instances
A
C
10. BigML Education Program 10Association Discovery
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never
implies C
A sometimes
implies C
A always
implies C
11. BigML Education Program 11Association Discovery
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
12. BigML Education Program 12Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
13. BigML Education Program 13Association Discovery
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
Problem:
if p(C) is "small" then…
lift may be large.
14. BigML Education Program 14Association Discovery
Association Metrics
Leverage
Difference of observed
support and support if A
and C were statistically
independent.
Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
15. BigML Education Program 15Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
16. BigML Education Program 16Association Discovery
AD Configuration
1. Max Number of Associations: 1 to 500 (default 100)
2. Max Items in Antecedent: 1 to 10 (default 4)
3. Search Strategy: Support / Coverage / Confidence / Lift / Leverage
4. Complement Items: True / False
• False: Coffee and…
• True: Not Coffee and…
5. Missing Items: True / False
• False: Loan Description contains "Ferrari" and…
• True: Loan Description is missing and…
17. BigML Education Program 17Association Discovery
Data Types
numeric
1 2 3
1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical
A B C
date-time2013-09-25 10:02
DATE-TIME
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
text
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
“some”
appears 2 times
appears 1 time
appears 1 time
appears 2 times
18. BigML Education Program 18Association Discovery
Items Type
itemscoffee, sugar, milk, honey,
dish soap, bread
items
• Canonical example: shopping cart contents
• Single feature describing a list of items
• Each item separated by a comma (default)
20. BigML Education Program 20Association Discovery
AD Configuration
1. Measures: Set a minimum criteria for AD measures
2. Minimum Significance: lower values reduce spurious rules
3. Consequent: Restrict rules to a specific consequent criteria
4. Discretization: How numeric values are handled
• Pretty: rounds off discretized values: 20 instead of 20.234
• Size: the number of ranges (default 5)
• Type: equal population / width
• Trim: Removes percentage of values from the tails