SlideShare a Scribd company logo
1 of 48
Download to read offline
September 8-9, 2016
BigML, Inc 2
Association Discovery
Geoff Webb
Professor of Information Technology Research
Monash University, Melbourne, Australia
Finding interesting correlations
BigML, Inc 3Unsupervised Learning
• Algorithm: “Magnum Opus” from Geoff Webb
• Unsupervised Learning: Works with unlabelled
data, like clustering and anomaly detection.
• Learning Task: Find “interesting” relations
between variables.
Association Discovery
BigML, Inc 4Unsupervised Learning
Unsupervised Learning
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
Clustering
Anomaly Detection
similar
unusual
BigML, Inc 5Unsupervised Learning
{class = gas} amount < 100
Association Rules
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
The Sally 6788 sign food 26339 51
{customer = Bob, account = 3421} zip = 46140
Rules:
Antecedent Consequent
BigML, Inc 6Unsupervised Learning
Use Cases
• Market Basket Analysis
• Web usage patterns
• Intrusion detection
• Fraud detection
• Bioinformatics
• Medical risk factors
BigML, Inc 7Unsupervised Learning
Magnum Opus
What's wrong with frequent pattern mining?
BigML, Inc 8Unsupervised Learning
Magnum Opus
What's wrong with frequent pattern mining?
• Feast or famine

• often results in too few or too many patterns

• The vodka and caviar problem

• some high value patterns are infrequent

• Cannot handle dense data

• Minimum support may not be relevant

• cannot be low enough to capture all valid rules

• cannot be high enough to exclude all spurious rules
BigML, Inc 9Unsupervised Learning
Magnum Opus
Very infrequent patterns can be significant
Data file: Brijs retail.itl, 88162 cases / 16470 items

237 → 1 

[Coverage=3032; Support=28; Lift=3.06; p=1.99E-007]

237 & 4685 → 1 

[Coverage=19; Support=9; Lift=157.00; p=5.03E-012]

1159 → 1 

[Coverage=197; Support=9; Lift=15.14; p=1.13E-008]

4685 → 1 

[Coverage=270; Support=9; Lift=11.05; p=1.68E-007]

168 → 1 

[Coverage=293; Support=9; Lift=10.18; p=3.33E-007]

4382 → 1 

[Coverage=72; Support=8; Lift=36.83; p=6.26E-011]

168 & 4685 → 1 

[Coverage=9; Support=7; Lift=257.78; p=6.66E-011]
BigML, Inc 10Unsupervised Learning
Magnum Opus
Very high support patterns can be spurious
Data file: covtype.data 581012 cases / 125 values

ST15=0 → ST07=0 

[Coverage=581009; Support=580904; Confidence=1.000]

ST07=0 → ST15=0 

[Coverage=580907; Support=580904; Confidence=1.000]

ST15=0 → ST36=0 

[Coverage=581009; Support=580890; Confidence=1.000]

ST36=0 → ST15=0 

[Coverage=580893; Support=580890; Confidence=1.000]

ST15=0 → ST08=0 

[Coverage=581009; Support=580830; Confidence=1.000]

ST08=0 → ST15=0 

[Coverage=580833; Support=580830; Confidence=1.000]

… 197,183,686 such rules have highest support
BigML, Inc 11Unsupervised Learning
Magnum Opus
• User selects measure of interest

• System finds the top-k associations on that
measure within constraints 

• Must be statistically significant interaction between
antecedent and consequent

• Every item in the antecedent must increase the
strength of association
BigML, Inc 12Unsupervised Learning
Association Metrics
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
BigML, Inc 13Unsupervised Learning
Association Metrics
Support
Percentage of instances
which match antecedent
“A” and Consequent “C”
Instances
A
C
BigML, Inc 14Unsupervised Learning
Association Metrics
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Coverage
Support
Instances
A
C
BigML, Inc 15Unsupervised Learning
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never 

implies C
A sometimes 

implies C
A always 

implies C
BigML, Inc 16Unsupervised Learning
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
BigML, Inc 17Unsupervised Learning
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
BigML, Inc 18Unsupervised Learning
Association Metrics
Leverage
Difference of observed
support and support if A
and C were statistically
independent. 

Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
BigML, Inc 19Unsupervised Learning
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Association
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
BigML, Inc 20Unsupervised Learning
Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at
checkout
BigML, Inc 21Unsupervised Learning
Association Discovery

Demo #1
BigML, Inc 22Unsupervised Learning
Use Cases
GOAL: Find general rules that indicate diabetes.
• Dataset of diagnostic measurements of 768
patients.
• Each patient labelled True/False for
diabetes.
BigML, Inc 23Unsupervised Learning
Association Discovery

Demo #2
BigML, Inc 24Unsupervised Learning
Medical Risks
Decision Tree
If plasma glucose > 155
and bmi > 29.32
and diabetes pedigree > 0.32
and insulin <= 629
and age <= 44
then diabetes = TRUE
Association Rule
If plasma glucose > 146
then diabetes = TRUE
Latent Dirichlet Allocation
#VSSML16
September 2016
#VSSML16 Latent Dirichlet Allocation September 2016 1 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 2 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 3 / 24
Bag of Words Analysis
• Easiest way of analyzing a text
field is just to treat it as a “bag
of words”
• Each word is a separate
feature (usually an occurrence
count)
• When modeling, the features
are treated in isolation from
one another, essentially “one
at a time”
#VSSML16 Latent Dirichlet Allocation September 2016 4 / 24
Limitations
• Words are sometimes
ambiguous
• Both because of multiple
definitions and difference in
tone
• How do we usually
disambiguate words? Context
#VSSML16 Latent Dirichlet Allocation September 2016 5 / 24
An Instructive Example
• One way of looking at the usefulness of a machine learning
feature is to think about how well it isolates unique and coherent
subsets of the data
• Suppose I have a collection of documents where some of them
are about two different topics (via Ted Underwood’s Blog):
I Leadership (CEOs, organization, management)
I Chemistry (Elements, compounds, reactions)
• If I do a keyword search for “lead” (or try to classify documents
based on that word alone), I’ll get documents from either category
and documents that are a mix of both
• Can we build a feature that better isolates which set of documents
we’re looking for?
#VSSML16 Latent Dirichlet Allocation September 2016 6 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 7 / 24
Generative Modeling
• Posit a parameterized structure that is responsible for generating
the data
• Use the data to fit the parameters
• A notion of causality is important for these models
#VSSML16 Latent Dirichlet Allocation September 2016 8 / 24
Example of a Generative model
• Consider a patient with some
disease
• Class: Disease present /
absent, Features: Test results
• Arrows indicate cause in this
diagram; the symptoms
(features) are caused by the
disease
• This generative process
implies a structure; in this case
the so-called “Naive Bayes”
model
#VSSML16 Latent Dirichlet Allocation September 2016 9 / 24
Generative vs. Discriminative
• This is an important distinction in machine learning generally
• Generative models try to model / assume a structure for the
process generating the data
• More mathematically, generative classifiers explicitly model the
joint distribution p(x, y) of the data
• Discriminate models don’t care; they “solve the prediction problem
directly”, and model only the conditional p(y|x) (Vapnik)
#VSSML16 Latent Dirichlet Allocation September 2016 10 / 24
Which is Better?
• No general answer to this question (not that we haven’t tried):
Paper: On Discriminative vs. Generative Classifiers1
• Discriminative models tend to be faster to fit, quicker to predict,
and in the case of non-parametrics are often guaranteed to
converge to the correct answer given enough data
• Generative models tend to be more probabilistically sound and
able to do more than just classify
1
http:
//ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
#VSSML16 Latent Dirichlet Allocation September 2016 11 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 12 / 24
A New Way of Thinking About Documents
• Three entities: Documents,
Terms, and Topics
• A term is a single lexical token
(usually one or more words,
but can be any arbitrary string)
• A document has many terms
• A topic is a distribution over
terms
#VSSML16 Latent Dirichlet Allocation September 2016 13 / 24
A Generative Model for Documents
• A document can be thought of as a distribution over topics, drawn
from a distribution over possible distributions
• To create a document, repeatedly draw a topic at random from the
distribution, then draw a term from topic (which, remember, is a
distribution over terms)
• The main thing we want to infer is the topic distribution
#VSSML16 Latent Dirichlet Allocation September 2016 14 / 24
Dirichlet Process Intuition: Rich Get Richer
• We use a Dirichlet process to model the relationship between
documents, topics, and terms
• We’re more likely to think a word came from a topic if we’ve
already seen a bunch of words from that topic
• We’re more likely to think the topic was responsible for generating
the document if we’ve already seen a bunch of words in the
document from that topics.
• Here lies the disambiguation: If a word could have come from two
different topics, we use the rest of the words in the document to
decide which meaning it has
• Note that there’s a little bit of self-fulfilling prophecy going on here
(by design)
#VSSML16 Latent Dirichlet Allocation September 2016 15 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 16 / 24
Usenet Movie Reviews
Library of over 26,000 movie reviews
A solid noir melodrama from Vincent Sherman, who takes a standard
story and dresses it up with moving characterizations and beautifully
expressionistic B&W; photography from cinematographer James Wong Howe.
The director took a songwriter Paul Webster's short magazine story
called "The Man Who Died Twice" and improved the story by rounding out
the characters to give them both strong and weak points, so that they
would not be one-note characters as was the case in the original
story. The film was made by Warner Brothers, who needed a film for
their contract star Ann Sheridan and asked Sherman to change the story
around so that her part as Nora Prentiss, a nightclub singer, is
expanded
#VSSML16 Latent Dirichlet Allocation September 2016 17 / 24
Supreme Court Cases
Library of about 7500 Supreme Court Cases
NO. 136. ARGUED DECEMBER 6, 1966. - DECIDED JANUARY 9, 1967. - 258 F.
SUPP. 819, REVERSED.
FOLLOWING THIS COURT'S DECISIONS IN SWANN V. ADAMS, INVALIDATING THE
APPORTIONMENT OF THE FLORIDA LEGISLATURE (378 U.S. 553) AND THE
SUBSEQUENT REAPPORTIONMENT WHICH THE DISTRICT COURT HAD FOUND
UNCONSTITUTIONAL BUT APPROVED ON AN INTERIM BASIS (383 U.S. 210), THE
FLORIDA LEGISLATURE ADOPTED STILL ANOTHER LEGISLATIVE REAPPORTIONMENT
PLAN, WHICH APPELLANTS, RESIDENTS AND VOTERS OF DADE COUNTY, FLORIDA,
ATTACKED AS FAILING TO MEET THE STANDARDS OF VOTER EQUALITY SET FORTH
#VSSML16 Latent Dirichlet Allocation September 2016 18 / 24
Outline
1 Understanding the Limits of Simple Text Analysis
2 Aside: Generative Processes
3 Latent Dirichlet Allocation
4 A Couple of Instructive Examples
5 Applications
#VSSML16 Latent Dirichlet Allocation September 2016 19 / 24
Visualizing Changes in Topic Over Time
• Plot changes in topic distribution over time
• Especially nice for dated historical collections (e.g., novels,
newspapers)
#VSSML16 Latent Dirichlet Allocation September 2016 20 / 24
Search Without Keywords
• Keyword search is great, if you
know the keywords
• Good for finding search terms
• Great for, e.g., legal discovery
• Nice for finding “outliers”
• Surprise topics (From the
recycle bin)
#VSSML16 Latent Dirichlet Allocation September 2016 21 / 24
Feature Spaces for Classification
• Just classify the documents in “topic space” rather than “bag
space”
• The topics that come out of LDA have some nice benefits as
features
I Can reduce a feature space of thousands to a few dozen (faster to
fit)
I Nicely interpretable
I Automatically tailored to the documents you’ve provided
• Foreshadowing Alert: When using LDA in this way, we’re doing a
form of feature engineering which we’ll hear more about tomorrow.
#VSSML16 Latent Dirichlet Allocation September 2016 22 / 24
Some Caveats
• You need to choose the number of topics beforehand
• Takes forever, both to fit and to do inference
• Takes a lot of text to make it meaningful
• Tends to focus on “meaningless minutiae”
• While it sometimes makes a nice classification space, it’s a rare
case that provides dramatic improvement over bag-of-words
• I find it nice just for exploration
#VSSML16 Latent Dirichlet Allocation September 2016 23 / 24
Thus Ends The Lesson
Questions?
#VSSML16 Latent Dirichlet Allocation September 2016 24 / 24

More Related Content

Viewers also liked

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
Tomonari Masada
 

Viewers also liked (13)

Visualzing Topic Models
Visualzing Topic ModelsVisualzing Topic Models
Visualzing Topic Models
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
C4.5
C4.5C4.5
C4.5
 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic Regression
 
LDA入門
LDA入門LDA入門
LDA入門
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 

More from BigML, Inc

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 

VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

  • 2. BigML, Inc 2 Association Discovery Geoff Webb Professor of Information Technology Research Monash University, Melbourne, Australia Finding interesting correlations
  • 3. BigML, Inc 3Unsupervised Learning • Algorithm: “Magnum Opus” from Geoff Webb • Unsupervised Learning: Works with unlabelled data, like clustering and anomaly detection. • Learning Task: Find “interesting” relations between variables. Association Discovery
  • 4. BigML, Inc 4Unsupervised Learning Unsupervised Learning date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 Clustering Anomaly Detection similar unusual
  • 5. BigML, Inc 5Unsupervised Learning {class = gas} amount < 100 Association Rules date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 {customer = Bob, account = 3421} zip = 46140 Rules: Antecedent Consequent
  • 6. BigML, Inc 6Unsupervised Learning Use Cases • Market Basket Analysis • Web usage patterns • Intrusion detection • Fraud detection • Bioinformatics • Medical risk factors
  • 7. BigML, Inc 7Unsupervised Learning Magnum Opus What's wrong with frequent pattern mining?
  • 8. BigML, Inc 8Unsupervised Learning Magnum Opus What's wrong with frequent pattern mining? • Feast or famine • often results in too few or too many patterns • The vodka and caviar problem • some high value patterns are infrequent • Cannot handle dense data • Minimum support may not be relevant • cannot be low enough to capture all valid rules • cannot be high enough to exclude all spurious rules
  • 9. BigML, Inc 9Unsupervised Learning Magnum Opus Very infrequent patterns can be significant Data file: Brijs retail.itl, 88162 cases / 16470 items 237 → 1 
 [Coverage=3032; Support=28; Lift=3.06; p=1.99E-007] 237 & 4685 → 1 
 [Coverage=19; Support=9; Lift=157.00; p=5.03E-012] 1159 → 1 
 [Coverage=197; Support=9; Lift=15.14; p=1.13E-008] 4685 → 1 
 [Coverage=270; Support=9; Lift=11.05; p=1.68E-007] 168 → 1 
 [Coverage=293; Support=9; Lift=10.18; p=3.33E-007] 4382 → 1 
 [Coverage=72; Support=8; Lift=36.83; p=6.26E-011] 168 & 4685 → 1 
 [Coverage=9; Support=7; Lift=257.78; p=6.66E-011]
  • 10. BigML, Inc 10Unsupervised Learning Magnum Opus Very high support patterns can be spurious Data file: covtype.data 581012 cases / 125 values ST15=0 → ST07=0 
 [Coverage=581009; Support=580904; Confidence=1.000] ST07=0 → ST15=0 
 [Coverage=580907; Support=580904; Confidence=1.000] ST15=0 → ST36=0 
 [Coverage=581009; Support=580890; Confidence=1.000] ST36=0 → ST15=0 
 [Coverage=580893; Support=580890; Confidence=1.000] ST15=0 → ST08=0 
 [Coverage=581009; Support=580830; Confidence=1.000] ST08=0 → ST15=0 
 [Coverage=580833; Support=580830; Confidence=1.000] … 197,183,686 such rules have highest support
  • 11. BigML, Inc 11Unsupervised Learning Magnum Opus • User selects measure of interest • System finds the top-k associations on that measure within constraints • Must be statistically significant interaction between antecedent and consequent • Every item in the antecedent must increase the strength of association
  • 12. BigML, Inc 12Unsupervised Learning Association Metrics Coverage Percentage of instances which match antecedent “A” Instances A C
  • 13. BigML, Inc 13Unsupervised Learning Association Metrics Support Percentage of instances which match antecedent “A” and Consequent “C” Instances A C
  • 14. BigML, Inc 14Unsupervised Learning Association Metrics Confidence Percentage of instances in the antecedent which also contain the consequent. Coverage Support Instances A C
  • 15. BigML, Inc 15Unsupervised Learning Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  • 16. BigML, Inc 16Unsupervised Learning Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A
  • 17. BigML, Inc 17Unsupervised Learning Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C
  • 18. BigML, Inc 18Unsupervised Learning Association Metrics Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ] Independent A C C Observed A
  • 19. BigML, Inc 19Unsupervised Learning Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C -1…
  • 20. BigML, Inc 20Unsupervised Learning Use Cases GOAL: Discover “interesting” rules about what store items are typically purchased together. • Dataset of 9,834 grocery cart transactions • Each row is a list of all items in a cart at checkout
  • 21. BigML, Inc 21Unsupervised Learning Association Discovery
 Demo #1
  • 22. BigML, Inc 22Unsupervised Learning Use Cases GOAL: Find general rules that indicate diabetes. • Dataset of diagnostic measurements of 768 patients. • Each patient labelled True/False for diabetes.
  • 23. BigML, Inc 23Unsupervised Learning Association Discovery
 Demo #2
  • 24. BigML, Inc 24Unsupervised Learning Medical Risks Decision Tree If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44 then diabetes = TRUE Association Rule If plasma glucose > 146 then diabetes = TRUE
  • 25. Latent Dirichlet Allocation #VSSML16 September 2016 #VSSML16 Latent Dirichlet Allocation September 2016 1 / 24
  • 26. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 2 / 24
  • 27. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 3 / 24
  • 28. Bag of Words Analysis • Easiest way of analyzing a text field is just to treat it as a “bag of words” • Each word is a separate feature (usually an occurrence count) • When modeling, the features are treated in isolation from one another, essentially “one at a time” #VSSML16 Latent Dirichlet Allocation September 2016 4 / 24
  • 29. Limitations • Words are sometimes ambiguous • Both because of multiple definitions and difference in tone • How do we usually disambiguate words? Context #VSSML16 Latent Dirichlet Allocation September 2016 5 / 24
  • 30. An Instructive Example • One way of looking at the usefulness of a machine learning feature is to think about how well it isolates unique and coherent subsets of the data • Suppose I have a collection of documents where some of them are about two different topics (via Ted Underwood’s Blog): I Leadership (CEOs, organization, management) I Chemistry (Elements, compounds, reactions) • If I do a keyword search for “lead” (or try to classify documents based on that word alone), I’ll get documents from either category and documents that are a mix of both • Can we build a feature that better isolates which set of documents we’re looking for? #VSSML16 Latent Dirichlet Allocation September 2016 6 / 24
  • 31. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 7 / 24
  • 32. Generative Modeling • Posit a parameterized structure that is responsible for generating the data • Use the data to fit the parameters • A notion of causality is important for these models #VSSML16 Latent Dirichlet Allocation September 2016 8 / 24
  • 33. Example of a Generative model • Consider a patient with some disease • Class: Disease present / absent, Features: Test results • Arrows indicate cause in this diagram; the symptoms (features) are caused by the disease • This generative process implies a structure; in this case the so-called “Naive Bayes” model #VSSML16 Latent Dirichlet Allocation September 2016 9 / 24
  • 34. Generative vs. Discriminative • This is an important distinction in machine learning generally • Generative models try to model / assume a structure for the process generating the data • More mathematically, generative classifiers explicitly model the joint distribution p(x, y) of the data • Discriminate models don’t care; they “solve the prediction problem directly”, and model only the conditional p(y|x) (Vapnik) #VSSML16 Latent Dirichlet Allocation September 2016 10 / 24
  • 35. Which is Better? • No general answer to this question (not that we haven’t tried): Paper: On Discriminative vs. Generative Classifiers1 • Discriminative models tend to be faster to fit, quicker to predict, and in the case of non-parametrics are often guaranteed to converge to the correct answer given enough data • Generative models tend to be more probabilistically sound and able to do more than just classify 1 http: //ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf #VSSML16 Latent Dirichlet Allocation September 2016 11 / 24
  • 36. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 12 / 24
  • 37. A New Way of Thinking About Documents • Three entities: Documents, Terms, and Topics • A term is a single lexical token (usually one or more words, but can be any arbitrary string) • A document has many terms • A topic is a distribution over terms #VSSML16 Latent Dirichlet Allocation September 2016 13 / 24
  • 38. A Generative Model for Documents • A document can be thought of as a distribution over topics, drawn from a distribution over possible distributions • To create a document, repeatedly draw a topic at random from the distribution, then draw a term from topic (which, remember, is a distribution over terms) • The main thing we want to infer is the topic distribution #VSSML16 Latent Dirichlet Allocation September 2016 14 / 24
  • 39. Dirichlet Process Intuition: Rich Get Richer • We use a Dirichlet process to model the relationship between documents, topics, and terms • We’re more likely to think a word came from a topic if we’ve already seen a bunch of words from that topic • We’re more likely to think the topic was responsible for generating the document if we’ve already seen a bunch of words in the document from that topics. • Here lies the disambiguation: If a word could have come from two different topics, we use the rest of the words in the document to decide which meaning it has • Note that there’s a little bit of self-fulfilling prophecy going on here (by design) #VSSML16 Latent Dirichlet Allocation September 2016 15 / 24
  • 40. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 16 / 24
  • 41. Usenet Movie Reviews Library of over 26,000 movie reviews A solid noir melodrama from Vincent Sherman, who takes a standard story and dresses it up with moving characterizations and beautifully expressionistic B&W; photography from cinematographer James Wong Howe. The director took a songwriter Paul Webster's short magazine story called "The Man Who Died Twice" and improved the story by rounding out the characters to give them both strong and weak points, so that they would not be one-note characters as was the case in the original story. The film was made by Warner Brothers, who needed a film for their contract star Ann Sheridan and asked Sherman to change the story around so that her part as Nora Prentiss, a nightclub singer, is expanded #VSSML16 Latent Dirichlet Allocation September 2016 17 / 24
  • 42. Supreme Court Cases Library of about 7500 Supreme Court Cases NO. 136. ARGUED DECEMBER 6, 1966. - DECIDED JANUARY 9, 1967. - 258 F. SUPP. 819, REVERSED. FOLLOWING THIS COURT'S DECISIONS IN SWANN V. ADAMS, INVALIDATING THE APPORTIONMENT OF THE FLORIDA LEGISLATURE (378 U.S. 553) AND THE SUBSEQUENT REAPPORTIONMENT WHICH THE DISTRICT COURT HAD FOUND UNCONSTITUTIONAL BUT APPROVED ON AN INTERIM BASIS (383 U.S. 210), THE FLORIDA LEGISLATURE ADOPTED STILL ANOTHER LEGISLATIVE REAPPORTIONMENT PLAN, WHICH APPELLANTS, RESIDENTS AND VOTERS OF DADE COUNTY, FLORIDA, ATTACKED AS FAILING TO MEET THE STANDARDS OF VOTER EQUALITY SET FORTH #VSSML16 Latent Dirichlet Allocation September 2016 18 / 24
  • 43. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 19 / 24
  • 44. Visualizing Changes in Topic Over Time • Plot changes in topic distribution over time • Especially nice for dated historical collections (e.g., novels, newspapers) #VSSML16 Latent Dirichlet Allocation September 2016 20 / 24
  • 45. Search Without Keywords • Keyword search is great, if you know the keywords • Good for finding search terms • Great for, e.g., legal discovery • Nice for finding “outliers” • Surprise topics (From the recycle bin) #VSSML16 Latent Dirichlet Allocation September 2016 21 / 24
  • 46. Feature Spaces for Classification • Just classify the documents in “topic space” rather than “bag space” • The topics that come out of LDA have some nice benefits as features I Can reduce a feature space of thousands to a few dozen (faster to fit) I Nicely interpretable I Automatically tailored to the documents you’ve provided • Foreshadowing Alert: When using LDA in this way, we’re doing a form of feature engineering which we’ll hear more about tomorrow. #VSSML16 Latent Dirichlet Allocation September 2016 22 / 24
  • 47. Some Caveats • You need to choose the number of topics beforehand • Takes forever, both to fit and to do inference • Takes a lot of text to make it meaningful • Tends to focus on “meaningless minutiae” • While it sometimes makes a nice classification space, it’s a rare case that provides dramatic improvement over bag-of-words • I find it nice just for exploration #VSSML16 Latent Dirichlet Allocation September 2016 23 / 24
  • 48. Thus Ends The Lesson Questions? #VSSML16 Latent Dirichlet Allocation September 2016 24 / 24