SlideShare ist ein Scribd-Unternehmen logo
1 von 25
February 8, 2022
1. Why clustering?
2. Example 1, hierarchical clustering
3. Questions
4. Overview of common algorithms and strengths/limitations
5. Example 2, K-means clustering
6. Questions
2
3
4
Reverse looking
What are characteristics of
students who used/did not use
a service? Did/did not persist?
Who is the prototypical
student in each academic
program?
Forward looking
What characteristics comprise
our prospective student
personas?
How do risk factors overlap in
some groups of students?
5
The challenge
Contrary to national trends, Dallas College has experienced larger
than typical drops in female student re-enrollment patterns. Our
goal was to stem decreases for the Fall 2021 semester through a
text campaign.
The question
What messaging should be used to resonate with these students?
6
45k students
Diverse features for the model
• Demographics: race/ethnicity, gender, age
• Financial: income, employment intensity
• Household: household size, dependent count
• Academics: last term of enrollment, credits, GPA
• Special Pop membership
7
8
Step 1
Starts with a case as leaf node
Each case goes into that leaf
node or breaks into a new
node
Result is a Cluster Features
tree
Step 2
Leaf nodes are combined
through agglomeration
Result is several “best”
clusters with a silhouette score
9
10
African Am.
3 dependents
Part-time job
Hispanic
4 dependents
Part-time job
Hispanic
2 dependents
Full-time job
White
1 dependent
Full-time job
11
BIRCH and Two-step (SPSS proprietary; Ex 1)
K-means, K-modes, K-medoids (Ex 2)
Hierarchical (Agglomerative and Divisive)
12
13
14
15
Clusters based on
distance cutoff line
Distance cutoff line
Algorithm Variable Types
BIRCH / 2-Step Any, multiple at the same time
Hierarchical Any, but stick to one kind at a time and use
well-paired distance measure
K-means Continuous
K-modes Categorical
K-medoids Any, multiple at the same time with the
right distance measure (Gower)
16
17
Algorithm +/-
BIRCH / 2-Step Flexible variable types, fast compute, large
data; there is an element of a black box
Hierarchical Flexible modeling, highly explainable;
limited to small data
K-Means, K-Modes Easy, fast, large data; sensitive to K, curse
of dimensionality, clusters same sized
K-medoids Like K-Means but more flexible variable
types and more costly tuning & compute
18
• Population: 24 years and older adults enrolled for Fall 2021
• Features : Academics, Demographic, Financial , Household and Veteran status.
• Mix of discrete and continuous variables with majority of them being categorical data.
19
Below Federal poverty
Level
Below
Median
Lower half Above
Median
Upper half Above
Median
Median household income in Texas
Poverty Flag INCOME BIN
20
V1 V2 V3 V4 V5
0 1 1 0 1
V1 V2 V3 V4 V5
1 0 0 0 1
1+1+1+1 = 4
Dissimilarity measure
0: Mismatch 1: Match
• Randomly select the K initial centers
• Repeat
1. Assign the samples to nearest center
2. Update means/modes based on newly formed cluster
3. Calculate the cost (SSE/Sum of dissimilarity)
• Stop when cluster centers converges
Using frequency-based
method to calculate mode
instead of the mean of the
sample
Minimize the cost function
Sample 1
Sample 2
21
Student Personas
22
Missing values for categorical variables
such as employment status
Missing values for
continuous variables
such as income
Possible solutions
• Revisit our data warehouse to obtain as much
information as we can to fill in the missing values
• Imputation with K-Nearest neighbors with
Hamming distance for categorical data
• Imputation with K-Nearest neighbors with
Euclidean distance for continuous data
Jeremy Anderson, Associate Vice Chancellor of Strategic
Analytics, jeremy.anderson@dcccd.edu
Dillon Lu, Data Analyst, DLu@dcccd.edu

Weitere ähnliche Inhalte

Ähnlich wie Clustering Models to Assist in Student Outreach

Proposal defense apr19
Proposal defense apr19Proposal defense apr19
Proposal defense apr19
rgeurtz
 
Local Critical Issue Version 2
Local Critical Issue Version 2Local Critical Issue Version 2
Local Critical Issue Version 2
Tim Hasse
 
2.why the common_core_presentation_with_facilitators_notes_update_072213
2.why the common_core_presentation_with_facilitators_notes_update_0722132.why the common_core_presentation_with_facilitators_notes_update_072213
2.why the common_core_presentation_with_facilitators_notes_update_072213
WRHSlibrary
 
Localcriticalissueversion2 091109081645 Phpapp01
Localcriticalissueversion2 091109081645 Phpapp01Localcriticalissueversion2 091109081645 Phpapp01
Localcriticalissueversion2 091109081645 Phpapp01
Tim Hasse
 

Ähnlich wie Clustering Models to Assist in Student Outreach (20)

Proposal defense apr19
Proposal defense apr19Proposal defense apr19
Proposal defense apr19
 
Using Data to Mobilize Commuities and Change Lives
Using Data to Mobilize Commuities and Change LivesUsing Data to Mobilize Commuities and Change Lives
Using Data to Mobilize Commuities and Change Lives
 
CB FORUM FINAL
CB FORUM FINALCB FORUM FINAL
CB FORUM FINAL
 
Margaret Patton, Dissertation Defense, Dr. William Allan Kritsonis, PhD Disse...
Margaret Patton, Dissertation Defense, Dr. William Allan Kritsonis, PhD Disse...Margaret Patton, Dissertation Defense, Dr. William Allan Kritsonis, PhD Disse...
Margaret Patton, Dissertation Defense, Dr. William Allan Kritsonis, PhD Disse...
 
Dr. Margaret Curette Patton, PhD Dissertation, Dr. William Allan Kritsonis, D...
Dr. Margaret Curette Patton, PhD Dissertation, Dr. William Allan Kritsonis, D...Dr. Margaret Curette Patton, PhD Dissertation, Dr. William Allan Kritsonis, D...
Dr. Margaret Curette Patton, PhD Dissertation, Dr. William Allan Kritsonis, D...
 
Transitioning to Common Core: What it means, What to look for
Transitioning to Common Core: What it means, What to look forTransitioning to Common Core: What it means, What to look for
Transitioning to Common Core: What it means, What to look for
 
Local Critical Issue Version 2
Local Critical Issue Version 2Local Critical Issue Version 2
Local Critical Issue Version 2
 
Union Col lecture
Union Col lectureUnion Col lecture
Union Col lecture
 
Rabbani - Education
Rabbani - EducationRabbani - Education
Rabbani - Education
 
Getting to the Root Causes of Disproportionate Representation in Special Educ...
Getting to the Root Causes of Disproportionate Representation in Special Educ...Getting to the Root Causes of Disproportionate Representation in Special Educ...
Getting to the Root Causes of Disproportionate Representation in Special Educ...
 
Connecticut Core 2014
Connecticut Core 2014Connecticut Core 2014
Connecticut Core 2014
 
2.why the common_core_presentation_with_facilitators_notes_update_072213
2.why the common_core_presentation_with_facilitators_notes_update_0722132.why the common_core_presentation_with_facilitators_notes_update_072213
2.why the common_core_presentation_with_facilitators_notes_update_072213
 
New Canaan BOE
New Canaan BOENew Canaan BOE
New Canaan BOE
 
SACAC Session E.12 Sealing the Deal
SACAC Session E.12 Sealing the DealSACAC Session E.12 Sealing the Deal
SACAC Session E.12 Sealing the Deal
 
Localcriticalissueversion2 091109081645 Phpapp01
Localcriticalissueversion2 091109081645 Phpapp01Localcriticalissueversion2 091109081645 Phpapp01
Localcriticalissueversion2 091109081645 Phpapp01
 
Chautauqua County School Board Dinner
Chautauqua County School Board DinnerChautauqua County School Board Dinner
Chautauqua County School Board Dinner
 
Funding Dries Up For Non Profit And Educational Institutions Serving Black Co...
Funding Dries Up For Non Profit And Educational Institutions Serving Black Co...Funding Dries Up For Non Profit And Educational Institutions Serving Black Co...
Funding Dries Up For Non Profit And Educational Institutions Serving Black Co...
 
Credit risk predictive analytics
Credit risk predictive analytics Credit risk predictive analytics
Credit risk predictive analytics
 
Missouri ACT Identified Keys to Enrollment Success
Missouri ACT Identified Keys to Enrollment SuccessMissouri ACT Identified Keys to Enrollment Success
Missouri ACT Identified Keys to Enrollment Success
 
Top 11 Metrics Every Financial Aid Director Should Be Measuring
Top 11 Metrics Every Financial Aid Director Should Be MeasuringTop 11 Metrics Every Financial Aid Director Should Be Measuring
Top 11 Metrics Every Financial Aid Director Should Be Measuring
 

Mehr von Jeremy Anderson

Mehr von Jeremy Anderson (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
How I Lead and Manage
How I Lead and ManageHow I Lead and Manage
How I Lead and Manage
 
Creating a Print-on-Demand Initiative for Open Educational Resources
Creating a Print-on-Demand Initiative for Open Educational ResourcesCreating a Print-on-Demand Initiative for Open Educational Resources
Creating a Print-on-Demand Initiative for Open Educational Resources
 
How any institution can get started on learning analytics
How any institution can get started on learning analyticsHow any institution can get started on learning analytics
How any institution can get started on learning analytics
 
Creating Wraparound Supports for Students through Internal Partnerships
Creating Wraparound Supports for Students through Internal PartnershipsCreating Wraparound Supports for Students through Internal Partnerships
Creating Wraparound Supports for Students through Internal Partnerships
 
Four LMS Tools to Change Your Life
Four LMS Tools to Change Your LifeFour LMS Tools to Change Your Life
Four LMS Tools to Change Your Life
 
Addressing the Adjunct Underclass: Fit and Employment Outcomes in Part-Time F...
Addressing the Adjunct Underclass: Fit and Employment Outcomes in Part-Time F...Addressing the Adjunct Underclass: Fit and Employment Outcomes in Part-Time F...
Addressing the Adjunct Underclass: Fit and Employment Outcomes in Part-Time F...
 
Plan an OER Stragegic Plan: Change Models and Processes
Plan an OER Stragegic Plan: Change Models and ProcessesPlan an OER Stragegic Plan: Change Models and Processes
Plan an OER Stragegic Plan: Change Models and Processes
 
Sustaining Accessible OER at Scale
Sustaining Accessible OER at ScaleSustaining Accessible OER at Scale
Sustaining Accessible OER at Scale
 
Hybrid Course Design - Will It Blend?!
Hybrid Course Design - Will It Blend?!Hybrid Course Design - Will It Blend?!
Hybrid Course Design - Will It Blend?!
 
The Triple A (AAA) of OER: Accessibility, Availability, and Affordability
The Triple A (AAA) of OER: Accessibility, Availability, and AffordabilityThe Triple A (AAA) of OER: Accessibility, Availability, and Affordability
The Triple A (AAA) of OER: Accessibility, Availability, and Affordability
 
Case Study: Increasing Access through OER Adoption
Case Study: Increasing Access through OER AdoptionCase Study: Increasing Access through OER Adoption
Case Study: Increasing Access through OER Adoption
 
What data can deliver: A new way of operating
What data can deliver: A new way of operatingWhat data can deliver: A new way of operating
What data can deliver: A new way of operating
 
2018 Horizon Report Webinar: Adaptive Learning and OER at Scale
2018 Horizon Report Webinar: Adaptive Learning and OER at Scale2018 Horizon Report Webinar: Adaptive Learning and OER at Scale
2018 Horizon Report Webinar: Adaptive Learning and OER at Scale
 
The Nitty Gritty of OER Adoption
The Nitty Gritty of OER AdoptionThe Nitty Gritty of OER Adoption
The Nitty Gritty of OER Adoption
 
The Path to Creating an Integrated Online Contingent Faculty Competency System
The Path to Creating an Integrated Online Contingent Faculty Competency SystemThe Path to Creating an Integrated Online Contingent Faculty Competency System
The Path to Creating an Integrated Online Contingent Faculty Competency System
 
Strategies to Scale Adaptive Course Design
Strategies to Scale Adaptive Course DesignStrategies to Scale Adaptive Course Design
Strategies to Scale Adaptive Course Design
 
OER in the Time of Adaptive
OER in the Time of AdaptiveOER in the Time of Adaptive
OER in the Time of Adaptive
 
Implementing Adaptive, Data-Driven Course Design to Improve Student Learning
Implementing Adaptive, Data-Driven Course Design to Improve Student LearningImplementing Adaptive, Data-Driven Course Design to Improve Student Learning
Implementing Adaptive, Data-Driven Course Design to Improve Student Learning
 
Lessons from Adopting an Adaptive Learning Platform
Lessons from Adopting an Adaptive Learning PlatformLessons from Adopting an Adaptive Learning Platform
Lessons from Adopting an Adaptive Learning Platform
 

Kürzlich hochgeladen

Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 

Kürzlich hochgeladen (20)

How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 

Clustering Models to Assist in Student Outreach

  • 2. 1. Why clustering? 2. Example 1, hierarchical clustering 3. Questions 4. Overview of common algorithms and strengths/limitations 5. Example 2, K-means clustering 6. Questions 2
  • 3. 3
  • 4. 4 Reverse looking What are characteristics of students who used/did not use a service? Did/did not persist? Who is the prototypical student in each academic program? Forward looking What characteristics comprise our prospective student personas? How do risk factors overlap in some groups of students?
  • 5. 5
  • 6. The challenge Contrary to national trends, Dallas College has experienced larger than typical drops in female student re-enrollment patterns. Our goal was to stem decreases for the Fall 2021 semester through a text campaign. The question What messaging should be used to resonate with these students? 6
  • 7. 45k students Diverse features for the model • Demographics: race/ethnicity, gender, age • Financial: income, employment intensity • Household: household size, dependent count • Academics: last term of enrollment, credits, GPA • Special Pop membership 7
  • 8. 8 Step 1 Starts with a case as leaf node Each case goes into that leaf node or breaks into a new node Result is a Cluster Features tree Step 2 Leaf nodes are combined through agglomeration Result is several “best” clusters with a silhouette score
  • 9. 9
  • 10. 10 African Am. 3 dependents Part-time job Hispanic 4 dependents Part-time job Hispanic 2 dependents Full-time job White 1 dependent Full-time job
  • 11. 11
  • 12. BIRCH and Two-step (SPSS proprietary; Ex 1) K-means, K-modes, K-medoids (Ex 2) Hierarchical (Agglomerative and Divisive) 12
  • 13. 13
  • 14. 14
  • 15. 15 Clusters based on distance cutoff line Distance cutoff line
  • 16. Algorithm Variable Types BIRCH / 2-Step Any, multiple at the same time Hierarchical Any, but stick to one kind at a time and use well-paired distance measure K-means Continuous K-modes Categorical K-medoids Any, multiple at the same time with the right distance measure (Gower) 16
  • 17. 17 Algorithm +/- BIRCH / 2-Step Flexible variable types, fast compute, large data; there is an element of a black box Hierarchical Flexible modeling, highly explainable; limited to small data K-Means, K-Modes Easy, fast, large data; sensitive to K, curse of dimensionality, clusters same sized K-medoids Like K-Means but more flexible variable types and more costly tuning & compute
  • 18. 18
  • 19. • Population: 24 years and older adults enrolled for Fall 2021 • Features : Academics, Demographic, Financial , Household and Veteran status. • Mix of discrete and continuous variables with majority of them being categorical data. 19 Below Federal poverty Level Below Median Lower half Above Median Upper half Above Median Median household income in Texas Poverty Flag INCOME BIN
  • 20. 20 V1 V2 V3 V4 V5 0 1 1 0 1 V1 V2 V3 V4 V5 1 0 0 0 1 1+1+1+1 = 4 Dissimilarity measure 0: Mismatch 1: Match • Randomly select the K initial centers • Repeat 1. Assign the samples to nearest center 2. Update means/modes based on newly formed cluster 3. Calculate the cost (SSE/Sum of dissimilarity) • Stop when cluster centers converges Using frequency-based method to calculate mode instead of the mean of the sample Minimize the cost function Sample 1 Sample 2
  • 22. 22 Missing values for categorical variables such as employment status Missing values for continuous variables such as income Possible solutions • Revisit our data warehouse to obtain as much information as we can to fill in the missing values • Imputation with K-Nearest neighbors with Hamming distance for categorical data • Imputation with K-Nearest neighbors with Euclidean distance for continuous data
  • 23.
  • 24.
  • 25. Jeremy Anderson, Associate Vice Chancellor of Strategic Analytics, jeremy.anderson@dcccd.edu Dillon Lu, Data Analyst, DLu@dcccd.edu

Hinweis der Redaktion

  1. REFER to recent Dallas Morning News Article and its Framing Give brief overview – focused on 2 components – Data and Demographics by Core Populations AND Interventions to help support all students.
  2. You have a population in n dimensions of space Your model breaks that into subpopulations that are self-similar and distinct from other subpopulations There are many algorithms that attack this basic task in different ways
  3. The idea with all of these kinds of research question is that you would take a very large population and break it down into smaller groups. Then, you customize the messaging to fit the cluster personas. Used a lot in the marketing world to create customer personas, but has applications to any kind of “product” like academic programs and student services.
  4. The case count was fairly high. 45k+ female students had enrolled in prior three terms, had not attained a credential or transferred, and were not enrolled for the Fall. Some variables like income and household size are highly missing if we rely just on FAFSA because only about half of our students complete it. Instead, we supplement by using a homegrown Student Information Profile that goes out 1x a year and asks those and other questions. Part of data preparation is merging the two data streams, which also is a challenge because they are binned as categorical in the SIP but are continuous in FAFSA. The other variables are highly available and mostly clean, so very little cleanup was necessary. Other than that, we do some regular feature engineering around membership in special populations like athletes, international students, foster students, veterans, etc. That's code that we've written and reuse across projects to save some time. After the data prep, I stepped back and saw that the features for consideration were all mixed. Some were binary and others categorical or continuous. For that reason, and for the size of the data, I chose the two-step clustering algorithm in SPSS because it works well under both conditions and because, as a social scientist by training, I am most familiar with SPSS as an analysis tool. With the data ready, I then played around with the features to maximize the evaluation score I was seeing. Loaded them all in at once, then looked at the cluster overlap charts which is a feature in SPSS. Took out ones that didn’t have strong separation and was left with a collection of features that would be most useful/impactful for final modeling.
  5. The tool I chose to use was SPSS because that's my bread and butter as a social scientist by trade. It has a couple of built-in options for clustering. In step 1, the similarity score, based on the distance between the case and the node values, is the determining factor. If the distance is above the algorithm's threshold and the node would spread to wide to incorporate it, then the data point goes into a newly created node.  In step 2, the nodes from the tree start to get clumped together, also based on their distances. Lots of different cluster counts are tried and evaluated automatically with one of the available clustering criteria that you can choose from in the model settings in SPSS Ultimately, you get a set of the best clusters for the data. The model picks the number of clusters automatically through the Cluster Tree creation and the evaluation of the agglomerations.
  6. I got my silhouette measure that gave me a sense of how well the points in each cluster were self-similar and how much each cluster was different from the others. Score was ranging from 0.6 to 0.7, which falls in the good range Was able to click into the scoring for details. That’s the clusters matrix that shows the clusters as columns and the features used to build the clusters as rows Can click on one of the cells to see that feature’s overall distribution and what’s present in the cluster in the Cluster Distribution visual Can click multiple clusters to see how they compare in a visual way in the Cluster Comparison
  7. Hispanic, Black, or White, with 3-4 dependents, employed part-time Hispanic, 3 dependents, employed full-time Black or White, 1-2 dependents, employed full-time Overriding message was POC with more dependents, especially if employed less than full-time. The hypothesis there was that we should focus messaging on emergency aid Whereas full-time workers with fewer dependents might benefit from an awareness campaign about our childcare assistance. Really, though, we needed to test these assumptions. Used these clusters to then create a call list of ~350 students with the aim of talking to 50 (10-15 per cluster). The phone interviews were short, about 10 minutes, and guided by a set of template questions that our Student Success Research team and our student success staff worked on together. Results from the qualitative discussions left us with a collection of themes for each cluster and some prospective talking points to use in the text campaign. Ended up with 5600 students responding to the text campaign and re-enrolling, equating to an 18% response
  8. Before we take a quick tour of some other clustering algorithms and their strengths and weaknesses, let's pause for one or two comments or questions about our first example. We will also have some time at the end for other questions.
  9. Today, we're looking at three very common categories of clustering algorithms, though there are others that are applicable to other types of challenges. Already covered two-step, which is similar to BIRCH, in example 1. The point there is that it blends the approach of the other types of algorithms listed to get the best of both worlds. For comparison, then, we’ll look at these other two common approaches in the K-algorithms and hierarchical algorithm.
  10. The premise of all three of these is you pick a number of K (clusters). Rather than pick at random, two good ways to narrow in on the best K for the data are:  elbow method – for this it would be somewhere between 6 and 8 that you might want to try Silhouette Coefficient The cluster picking methods and the algorithms in general are available in Python's Scikit-learn package and they're in R, but it'll take a few packages
  11. So, how does it work, generally? K-means places points as the center of the clusters, whereas K-medoids chooses the most representative case. K-medoids is less sensitive to outliers, like using the median rather than the mean, so it may be preferable to K-means depending on your data K-mode, meanwhile, looks for the number of similarities between data points when considering the different features, but it still collapses all of this into a pre-determined number of clusters that you set.
  12. Match up pairs of cases based on distance between them Then match up pairs of pairs, also based on distance Pick what distance you’re okay with and use that to determine the clusters. That's where it's different from Two-Step and BIRCH which will do that automatically. Divisive flips this on its head and starts with all of the variables in one Available in Python scikit-learn and R
  13. Divide our student body into groups so we can look at each specific group more closely and identify what they need and what can we do to help them reach their educational goals. 
  14. K-mode clustering   Determine initial number of clusters K Minimize cost function while maintain relatively small value for K.  Purpose of study  is that we want to divide our student body into large groups so we can study each group more effectively therefore to have a better understanding of our student body as a whole. We prefer generalization rather than looking at specific cases. Choose 4 in this case and move on to the next page  
  15. This information can help us to develop strategic plans that focus on a particular group of student to help them to achieve their goals and be successful here at our institution. 
  16. Many students with missing income end up in the second income bin due to nature of our dataset design  Revisit our data and eliminate as many missing value as possible to obtain a more complete data should help greatly to have our students more evenly distributed in INCOME_BIN variable