SlideShare a Scribd company logo
1 of 21
Download to read offline
© 2019 Iver Band© 2019 Iver Band
Chronic Absenteeism Rate
Prediction (CARP)
Coursera Advanced Data Science with IBM Specialization
Capstone Project
Iver Band
December, 2019
© 2019 Iver Band© 2019 Iver Band
Agenda
● Use Case
● Source Data Sets
● Architectural Overview
● Data Quality Assessment
● Data Visualization
● Data Exploration
● Initial Feature Engineering
● Model Performance Indicator
● Core Concepts
● Algorithms
● Frameworks
● Feature Engineering Experiment
● Model Performance Evaluation
● Conclusion
© 2019 Iver Band
© 2019 Iver Band
Chronic Absenteeism Rate Prediction (CARP)
Use Case
● Chronic absenteeism occurs when a student in grades K-12 misses 10%
or more of the school year for any reason
● It is a strong predictor of low academic achievement.
● CARP is for data scientists supporting educational and social services
administrators and policymakers in answering the following questions:
○ What demographic factors predict the rate of chronic absenteeism?
○ Can we use demographic data to predict chronic absenteeism rates?
© 2019 Iver Band
© 2019 Iver Band
Source Data Sets
● 2018 Chronic Absenteeism Data from California Department of Education
○ Counts of total students and chronically absent K-12 students by census tract in Los
Angeles County, California, USA
● US Census Bureau American Community Survey 2013-2017
○ Statistics on potential predictors of 2018 chronic absenteeism rates:
Median Income Employment Status
Race Length of Commute
Educational Attainment Disability
Geographic Mobility Marital Status
Health Insurance Status Citizenship and Nativity
English Language Mastery
© 2019 Iver Band© 2019 Iver Band
data.census.gov www.cde.ca.gov
Personal Laptop
Dataplatform.ibm.com object store
CARP-ETL:
Join and
otherwise
prepare data
CARP-EXP:
Explore and
visualize data
CARP-DNN:
Define, train,
and test deep
neural network
regressor
CARP-DTE:
Define, train,
and test
AdaBoost
Decision Tree
Regressor
CARP-ME:
Evaluate and
compare model
performance
US Census Bureau California Dept of Education
Data Sources:
Jupyter
Notebooks:
Architectural Overview
© 2019 Iver Band© 2019 Iver Band
Data Quality
Assessment
● Determined
percentage of rows
with missing or non-
numeric data in
○ All 11 input
data sets prior
to cleansing
○ The joined data
set
● Primary issue is 84%
coverage of LA
County census tracts,
● Data about 2158
census tracts was
available for
modeling
© 2019 Iver Band© 2019 Iver Band
● Visualized distribution of
chronic absenteeism
percentages with
○ Histogram
○ Kernel Density
Estimate
○ Rug plot
● Distribution is roughly
normal, with positive
skewness
Data Visualization
© 2019 Iver Band© 2019 Iver Band
Data Visualization
● Visualized geographic
distribution of chronic
absenteeism rates with
a choropleth map
● Adjacent or close
census tracts tend to
have similar rates
© 2019 Iver Band© 2019 Iver Band
Data Exploration
● Visualized
relationship of top
15 correlates per
r2 score using pair
plots with
regression lines
● Read left to right,
then top to bottom
● Race, educational
attainment,
marital status and
income are the
strongest
predictors
● Prediction
strength declines
sharply after the
top five predictors
© 2019 Iver Band
© 2019 Iver Band
Initial Feature Engineering
● Converted counts into percentages in
○ California Department of Education chronic absenteeism data set
○ Nearly all of the American Community Survey data sets
● Joined all twelve data sets by six-digit US census tract number, which is
unique with each county
● For deep neural network model only
○ Calculated r2 for each predictor/target variable pair
○ Input only the top fifteen predictors
○ Scaled all data to have a mean of zero and a standard deviation of 1
© 2019 Iver Band© 2019 Iver Band
Model Performance Indicator
Coefficient of Determination (r2 )
● Measures the extent to which the predicted rates
reflect the total variability of the actual rates
● Disproportionately penalizes large errors
● Prevents errors with opposing signs from cancelling
each other out
Formula
© 2019 Iver Band© 2019 Iver Band
Core Concept: Neural Network
Feed Forward
Back Propagation
• Feed forward: prediction by computing the
value of each node, layer-by-layer, based on
multiplying the nodes in the previous layer by
a set of weights, adding the products and a
bias, and applying a nonlinear activation
function
• Back propagation: adjusting the weights and
bias by differentiating the feed forward
function to determine its slope, and applying
gradient descent (next slide).
Image from https://victorzhou.com/blog/intro-to-neural-networks/
© 2019 Iver Band© 2019 Iver Band
Core Concept: Gradient Descent for Neural
Network Model Training
• Imagine trying to find the deepest
valley in a deep fog
• Altitude is loss function, which
compares the predicted value to the
actual value
• The size of the step you take is the
learning rate
• Descending the valley is gradient
descent
• Your location in n-dimensional
space is determined by using model
weights and the loss function as
coordinates
• The model-fitting algorithm
repeatedly determines the gradient
(slope) of the loss and adjusts
model weights and biases to
descend the gradient
Text and picture adapted from
https://mc.ai/stochastic-gradient-descent-in-plain-english/
© 2019 Iver Band© 2019 Iver Band
Core Concept: CART* for Decision Tree Modeling
Do I play golf today? Yes or No?
…
Build next leaf with
decision criteria with
lowest Gini score:
Adapted from: https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/*Classification and Regression Trees
© 2019 Iver Band© 2019 Iver Band
Algorithms
Feed-Forward Neural Network
● Dimensions of each layer
○ 15-->11-->7-->1
● RELU Activation for all nodes
● f(x) = 0 if x <=0
● f(x) = x otherwise
● Adam Optimizer
○ 0.01 learning rate
○ 0.001 decay
○ Mean squared error loss metric
● Batch size = 12
● 25-40 epochs (passes through data)
● Early stopping for training runs that don’t
converge
● Attributes chosen after wide experimentation
Decision Tree Ensemble
● AdaBoost with decision tree as weak
learner
o Repeatedly builds decision trees,
weighting them to improve weakest
predictions
o Squared error loss metric
● Method chosen after experimentation with
linear regression and standalone decision
tree regression, as well as other decision
tree ensembles, such as random forest
© 2019 Iver Band© 2019 Iver Band
Frameworks
All Python-based
Purpose Frameworks
Interactive Development Jupyter within IBM Watson Studio
Data Manipulation and Computation Numpy, Pandas, Geopandas
Data Visualization Matplotlib, Seaborn, Descartes
Neural Network Modeling Keras with TensorFlow backend
Data Scaling, Model Scoring and Decision
Tree Ensemble Modeling
Scikit-Learn
© 2019 Iver Band
© 2019 Iver Band
Feature Engineering Experiment
● Choropleth map shows adjacent census tracts tend to have similar chronic
absenteeism rates
● Many adjacent or nearby census tracts in LA County appear to have similar
or sequential census tract numbers
● Therefore, would including the six-digit census tract number as a predictor
variable, in addition to its role as a key, improve model performance?
● Unfortunately not. The census tract number is among the weakest of the
predictor variables examined, based on
○ The pairwise r2 score
○ No performance change for the Decision Tree Ensemble model
© 2019 Iver Band© 2019 Iver Band
Model Performance Evaluation
● Tested both models with five
rounds of randomized shuffle-split
with 80/20 train-test ratio
● Chose randomized shuffle-split over
cross-validation to retain finer
control over size of split
○ Before tuning, got good
performance only with 90/10
train-test split
○ Future work could use cross-
validation
● The Decision Tree Ensemble is
clearly a stronger predictor than
the Neural Network
● To avoid outliers, chose iteration
with median performance of
stronger model for further
evaluation
© 2019 Iver Band© 2019 Iver Band
Model Performance Evaluation
● Compared residuals of predictions
using superimposed histograms
● For about 90% of test cases in its
median-performing iteration,
Decision Tree Ensemble predicted
chronic absenteeism rates within
5%
© 2019 Iver Band© 2019 Iver Band
Model Performance Evaluation
● Used a scatter plot with an
identity and +/- 5% lines to
visualize deviations of
predicted chronic absenteeism
rates from actual values
● Result is consistent with
residual histograms
© 2019 Iver Band
© 2019 Iver Band
Conclusion
● What demographic factors predict the rate of chronic absenteeism?
○ For the years studied, race, educational attainment, marital status and income are the
strongest predictors of chronic absenteeism in LA County, California, USA
● Can we use public demographic data to predict chronic absenteeism rates?
○ Yes, for the years studied, well enough to identify areas with highest future risk and,
perhaps, implement mitigating measures
● Opportunities for future work
○ Expand scope to include additional geographic areas, years, and predictor variables
○ Further automate data extraction and transformation
○ Consider additional algorithms, e.g. PCA, XGBoost, Polynomial Regression
○ Explore outliers where rate is more or less than expected
○ Explore effects of programs to mitigate chronic absenteeism
o Links
https://github.com/IverBand/carp
https://www.coursera.org/specializations/advanced-data-science-ibm

More Related Content

Similar to Chronic Absenteeism Rate Prediction: A Data Science Case Study

SNG Minnesota White Paper
SNG   Minnesota White PaperSNG   Minnesota White Paper
SNG Minnesota White Paper
Ann Treacy
 
2015_DSSG_high_school_Poster
2015_DSSG_high_school_Poster2015_DSSG_high_school_Poster
2015_DSSG_high_school_Poster
Reid Johnson
 

Similar to Chronic Absenteeism Rate Prediction: A Data Science Case Study (20)

Program eval webinar final v2
Program eval webinar final v2Program eval webinar final v2
Program eval webinar final v2
 
IRJET- Movie Success Prediction using Popularity Factor from Social Media
IRJET- Movie Success Prediction using Popularity Factor from Social MediaIRJET- Movie Success Prediction using Popularity Factor from Social Media
IRJET- Movie Success Prediction using Popularity Factor from Social Media
 
Practical Machine Learning at Work
Practical Machine Learning at WorkPractical Machine Learning at Work
Practical Machine Learning at Work
 
Predictive Analytics
Predictive AnalyticsPredictive Analytics
Predictive Analytics
 
SNG Minnesota White Paper
SNG   Minnesota White PaperSNG   Minnesota White Paper
SNG Minnesota White Paper
 
Comparison Study of Neural Network and Deep Neural Network on Repricing GAP P...
Comparison Study of Neural Network and Deep Neural Network on Repricing GAP P...Comparison Study of Neural Network and Deep Neural Network on Repricing GAP P...
Comparison Study of Neural Network and Deep Neural Network on Repricing GAP P...
 
OOD_PPT.pptx
OOD_PPT.pptxOOD_PPT.pptx
OOD_PPT.pptx
 
Final Project Statr 503
Final Project Statr 503Final Project Statr 503
Final Project Statr 503
 
2009 FUSION Presentation - Student Survey Data
2009 FUSION Presentation - Student Survey Data2009 FUSION Presentation - Student Survey Data
2009 FUSION Presentation - Student Survey Data
 
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
 
Fraud Detection in Financial Services using Graph Analysis and Machine Learning
Fraud Detection in Financial Services using Graph Analysis and Machine LearningFraud Detection in Financial Services using Graph Analysis and Machine Learning
Fraud Detection in Financial Services using Graph Analysis and Machine Learning
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
2015_DSSG_high_school_Poster
2015_DSSG_high_school_Poster2015_DSSG_high_school_Poster
2015_DSSG_high_school_Poster
 
Big data solutions for smallholder farmers in Southeast Asia: machine learnin...
Big data solutions for smallholder farmers in Southeast Asia: machine learnin...Big data solutions for smallholder farmers in Southeast Asia: machine learnin...
Big data solutions for smallholder farmers in Southeast Asia: machine learnin...
 
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining ToolIRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining Tool
 
Homework gap presentation
Homework gap presentationHomework gap presentation
Homework gap presentation
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
IRJET- Facial Age Estimation with Age Difference
IRJET-  	  Facial Age Estimation with Age DifferenceIRJET-  	  Facial Age Estimation with Age Difference
IRJET- Facial Age Estimation with Age Difference
 
Regression Discontinuity Method
Regression Discontinuity MethodRegression Discontinuity Method
Regression Discontinuity Method
 

More from Iver Band

Modeling Enterprise Risk Management and Security with the ArchiMate Language
Modeling Enterprise Risk Management and Security with the ArchiMate LanguageModeling Enterprise Risk Management and Security with the ArchiMate Language
Modeling Enterprise Risk Management and Security with the ArchiMate Language
Iver Band
 
From Capability-Based Planning to Competitive Advantage: Assembling Your Bus...
From Capability-Based Planning to Competitive Advantage:  Assembling Your Bus...From Capability-Based Planning to Competitive Advantage:  Assembling Your Bus...
From Capability-Based Planning to Competitive Advantage: Assembling Your Bus...
Iver Band
 

More from Iver Band (20)

Enhancing Organizational Performance by Creating a Culture of Stewardship wit...
Enhancing Organizational Performance by Creating a Culture of Stewardship wit...Enhancing Organizational Performance by Creating a Culture of Stewardship wit...
Enhancing Organizational Performance by Creating a Culture of Stewardship wit...
 
What Can We Do With The ArchiMate Language?
What Can We Do With The ArchiMate Language?What Can We Do With The ArchiMate Language?
What Can We Do With The ArchiMate Language?
 
The ArchiMate Language for Enterprise and Solution Architecture
The ArchiMate Language for Enterprise and Solution ArchitectureThe ArchiMate Language for Enterprise and Solution Architecture
The ArchiMate Language for Enterprise and Solution Architecture
 
Cloud architecture with the ArchiMate Language
Cloud architecture with the ArchiMate LanguageCloud architecture with the ArchiMate Language
Cloud architecture with the ArchiMate Language
 
Modeling Big Data with the ArchiMate 3.0 Language
Modeling Big Data with the ArchiMate 3.0 LanguageModeling Big Data with the ArchiMate 3.0 Language
Modeling Big Data with the ArchiMate 3.0 Language
 
ArchiMate 3.0: A New Standard for Architecture
ArchiMate 3.0: A New Standard for ArchitectureArchiMate 3.0: A New Standard for Architecture
ArchiMate 3.0: A New Standard for Architecture
 
An Introduction to the ArchiMate 3.0 Specification
An Introduction to the ArchiMate 3.0 SpecificationAn Introduction to the ArchiMate 3.0 Specification
An Introduction to the ArchiMate 3.0 Specification
 
Modeling and Evolving a Web Portal with the TOGAF Framework and the ArchiMate...
Modeling and Evolving a Web Portal with the TOGAF Framework and the ArchiMate...Modeling and Evolving a Web Portal with the TOGAF Framework and the ArchiMate...
Modeling and Evolving a Web Portal with the TOGAF Framework and the ArchiMate...
 
An Introduction to Enterprise Architecture Visual Modeling With The ArchiMate...
An Introduction to Enterprise Architecture Visual Modeling With The ArchiMate...An Introduction to Enterprise Architecture Visual Modeling With The ArchiMate...
An Introduction to Enterprise Architecture Visual Modeling With The ArchiMate...
 
Using the TOGAF® 9.1 Framework with the ArchiMate® 2.1 Modeling Language
Using the TOGAF® 9.1 Framework with the ArchiMate® 2.1 Modeling LanguageUsing the TOGAF® 9.1 Framework with the ArchiMate® 2.1 Modeling Language
Using the TOGAF® 9.1 Framework with the ArchiMate® 2.1 Modeling Language
 
Always-On Services for Consumer Web, Mobile and the Internet of Things
Always-On Services for Consumer Web, Mobile and the Internet of ThingsAlways-On Services for Consumer Web, Mobile and the Internet of Things
Always-On Services for Consumer Web, Mobile and the Internet of Things
 
Effective Strategy Execution with Capability-Based Planning, Enterprise Arch...
Effective Strategy Execution with Capability-Based Planning, Enterprise Arch...Effective Strategy Execution with Capability-Based Planning, Enterprise Arch...
Effective Strategy Execution with Capability-Based Planning, Enterprise Arch...
 
Thought Leader Interview: Atefeh Riazi on the Past, Present and Future of Met...
Thought Leader Interview: Atefeh Riazi on the Past, Present and Future of Met...Thought Leader Interview: Atefeh Riazi on the Past, Present and Future of Met...
Thought Leader Interview: Atefeh Riazi on the Past, Present and Future of Met...
 
Thought Leader Interview: Atefeh Riazi on the Past, Present and Future of Met...
Thought Leader Interview: Atefeh Riazi on the Past, Present and Future of Met...Thought Leader Interview: Atefeh Riazi on the Past, Present and Future of Met...
Thought Leader Interview: Atefeh Riazi on the Past, Present and Future of Met...
 
Modeling Enterprise Risk Management and Security with the ArchiMate Language
Modeling Enterprise Risk Management and Security with the ArchiMate LanguageModeling Enterprise Risk Management and Security with the ArchiMate Language
Modeling Enterprise Risk Management and Security with the ArchiMate Language
 
Guiding Agile Solution Delivery with the ArchiMate Language
Guiding Agile Solution Delivery with the ArchiMate LanguageGuiding Agile Solution Delivery with the ArchiMate Language
Guiding Agile Solution Delivery with the ArchiMate Language
 
Enterprise Architecture with the Zachman Framework and the Archimate Language
Enterprise Architecture with the Zachman Framework and the Archimate LanguageEnterprise Architecture with the Zachman Framework and the Archimate Language
Enterprise Architecture with the Zachman Framework and the Archimate Language
 
Book Review: Making Technology Investments Profitable
Book Review:  Making Technology Investments ProfitableBook Review:  Making Technology Investments Profitable
Book Review: Making Technology Investments Profitable
 
From Capability-Based Planning to Competitive Advantage: Assembling Your Bus...
From Capability-Based Planning to Competitive Advantage:  Assembling Your Bus...From Capability-Based Planning to Competitive Advantage:  Assembling Your Bus...
From Capability-Based Planning to Competitive Advantage: Assembling Your Bus...
 
Thought Leader Interview: Dr. William Turner on the Software­-Defined Future ...
Thought Leader Interview: Dr. William Turner on the Software­-Defined Future ...Thought Leader Interview: Dr. William Turner on the Software­-Defined Future ...
Thought Leader Interview: Dr. William Turner on the Software­-Defined Future ...
 

Recently uploaded

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Chronic Absenteeism Rate Prediction: A Data Science Case Study

  • 1. © 2019 Iver Band© 2019 Iver Band Chronic Absenteeism Rate Prediction (CARP) Coursera Advanced Data Science with IBM Specialization Capstone Project Iver Band December, 2019
  • 2. © 2019 Iver Band© 2019 Iver Band Agenda ● Use Case ● Source Data Sets ● Architectural Overview ● Data Quality Assessment ● Data Visualization ● Data Exploration ● Initial Feature Engineering ● Model Performance Indicator ● Core Concepts ● Algorithms ● Frameworks ● Feature Engineering Experiment ● Model Performance Evaluation ● Conclusion
  • 3. © 2019 Iver Band © 2019 Iver Band Chronic Absenteeism Rate Prediction (CARP) Use Case ● Chronic absenteeism occurs when a student in grades K-12 misses 10% or more of the school year for any reason ● It is a strong predictor of low academic achievement. ● CARP is for data scientists supporting educational and social services administrators and policymakers in answering the following questions: ○ What demographic factors predict the rate of chronic absenteeism? ○ Can we use demographic data to predict chronic absenteeism rates?
  • 4. © 2019 Iver Band © 2019 Iver Band Source Data Sets ● 2018 Chronic Absenteeism Data from California Department of Education ○ Counts of total students and chronically absent K-12 students by census tract in Los Angeles County, California, USA ● US Census Bureau American Community Survey 2013-2017 ○ Statistics on potential predictors of 2018 chronic absenteeism rates: Median Income Employment Status Race Length of Commute Educational Attainment Disability Geographic Mobility Marital Status Health Insurance Status Citizenship and Nativity English Language Mastery
  • 5. © 2019 Iver Band© 2019 Iver Band data.census.gov www.cde.ca.gov Personal Laptop Dataplatform.ibm.com object store CARP-ETL: Join and otherwise prepare data CARP-EXP: Explore and visualize data CARP-DNN: Define, train, and test deep neural network regressor CARP-DTE: Define, train, and test AdaBoost Decision Tree Regressor CARP-ME: Evaluate and compare model performance US Census Bureau California Dept of Education Data Sources: Jupyter Notebooks: Architectural Overview
  • 6. © 2019 Iver Band© 2019 Iver Band Data Quality Assessment ● Determined percentage of rows with missing or non- numeric data in ○ All 11 input data sets prior to cleansing ○ The joined data set ● Primary issue is 84% coverage of LA County census tracts, ● Data about 2158 census tracts was available for modeling
  • 7. © 2019 Iver Band© 2019 Iver Band ● Visualized distribution of chronic absenteeism percentages with ○ Histogram ○ Kernel Density Estimate ○ Rug plot ● Distribution is roughly normal, with positive skewness Data Visualization
  • 8. © 2019 Iver Band© 2019 Iver Band Data Visualization ● Visualized geographic distribution of chronic absenteeism rates with a choropleth map ● Adjacent or close census tracts tend to have similar rates
  • 9. © 2019 Iver Band© 2019 Iver Band Data Exploration ● Visualized relationship of top 15 correlates per r2 score using pair plots with regression lines ● Read left to right, then top to bottom ● Race, educational attainment, marital status and income are the strongest predictors ● Prediction strength declines sharply after the top five predictors
  • 10. © 2019 Iver Band © 2019 Iver Band Initial Feature Engineering ● Converted counts into percentages in ○ California Department of Education chronic absenteeism data set ○ Nearly all of the American Community Survey data sets ● Joined all twelve data sets by six-digit US census tract number, which is unique with each county ● For deep neural network model only ○ Calculated r2 for each predictor/target variable pair ○ Input only the top fifteen predictors ○ Scaled all data to have a mean of zero and a standard deviation of 1
  • 11. © 2019 Iver Band© 2019 Iver Band Model Performance Indicator Coefficient of Determination (r2 ) ● Measures the extent to which the predicted rates reflect the total variability of the actual rates ● Disproportionately penalizes large errors ● Prevents errors with opposing signs from cancelling each other out Formula
  • 12. © 2019 Iver Band© 2019 Iver Band Core Concept: Neural Network Feed Forward Back Propagation • Feed forward: prediction by computing the value of each node, layer-by-layer, based on multiplying the nodes in the previous layer by a set of weights, adding the products and a bias, and applying a nonlinear activation function • Back propagation: adjusting the weights and bias by differentiating the feed forward function to determine its slope, and applying gradient descent (next slide). Image from https://victorzhou.com/blog/intro-to-neural-networks/
  • 13. © 2019 Iver Band© 2019 Iver Band Core Concept: Gradient Descent for Neural Network Model Training • Imagine trying to find the deepest valley in a deep fog • Altitude is loss function, which compares the predicted value to the actual value • The size of the step you take is the learning rate • Descending the valley is gradient descent • Your location in n-dimensional space is determined by using model weights and the loss function as coordinates • The model-fitting algorithm repeatedly determines the gradient (slope) of the loss and adjusts model weights and biases to descend the gradient Text and picture adapted from https://mc.ai/stochastic-gradient-descent-in-plain-english/
  • 14. © 2019 Iver Band© 2019 Iver Band Core Concept: CART* for Decision Tree Modeling Do I play golf today? Yes or No? … Build next leaf with decision criteria with lowest Gini score: Adapted from: https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/*Classification and Regression Trees
  • 15. © 2019 Iver Band© 2019 Iver Band Algorithms Feed-Forward Neural Network ● Dimensions of each layer ○ 15-->11-->7-->1 ● RELU Activation for all nodes ● f(x) = 0 if x <=0 ● f(x) = x otherwise ● Adam Optimizer ○ 0.01 learning rate ○ 0.001 decay ○ Mean squared error loss metric ● Batch size = 12 ● 25-40 epochs (passes through data) ● Early stopping for training runs that don’t converge ● Attributes chosen after wide experimentation Decision Tree Ensemble ● AdaBoost with decision tree as weak learner o Repeatedly builds decision trees, weighting them to improve weakest predictions o Squared error loss metric ● Method chosen after experimentation with linear regression and standalone decision tree regression, as well as other decision tree ensembles, such as random forest
  • 16. © 2019 Iver Band© 2019 Iver Band Frameworks All Python-based Purpose Frameworks Interactive Development Jupyter within IBM Watson Studio Data Manipulation and Computation Numpy, Pandas, Geopandas Data Visualization Matplotlib, Seaborn, Descartes Neural Network Modeling Keras with TensorFlow backend Data Scaling, Model Scoring and Decision Tree Ensemble Modeling Scikit-Learn
  • 17. © 2019 Iver Band © 2019 Iver Band Feature Engineering Experiment ● Choropleth map shows adjacent census tracts tend to have similar chronic absenteeism rates ● Many adjacent or nearby census tracts in LA County appear to have similar or sequential census tract numbers ● Therefore, would including the six-digit census tract number as a predictor variable, in addition to its role as a key, improve model performance? ● Unfortunately not. The census tract number is among the weakest of the predictor variables examined, based on ○ The pairwise r2 score ○ No performance change for the Decision Tree Ensemble model
  • 18. © 2019 Iver Band© 2019 Iver Band Model Performance Evaluation ● Tested both models with five rounds of randomized shuffle-split with 80/20 train-test ratio ● Chose randomized shuffle-split over cross-validation to retain finer control over size of split ○ Before tuning, got good performance only with 90/10 train-test split ○ Future work could use cross- validation ● The Decision Tree Ensemble is clearly a stronger predictor than the Neural Network ● To avoid outliers, chose iteration with median performance of stronger model for further evaluation
  • 19. © 2019 Iver Band© 2019 Iver Band Model Performance Evaluation ● Compared residuals of predictions using superimposed histograms ● For about 90% of test cases in its median-performing iteration, Decision Tree Ensemble predicted chronic absenteeism rates within 5%
  • 20. © 2019 Iver Band© 2019 Iver Band Model Performance Evaluation ● Used a scatter plot with an identity and +/- 5% lines to visualize deviations of predicted chronic absenteeism rates from actual values ● Result is consistent with residual histograms
  • 21. © 2019 Iver Band © 2019 Iver Band Conclusion ● What demographic factors predict the rate of chronic absenteeism? ○ For the years studied, race, educational attainment, marital status and income are the strongest predictors of chronic absenteeism in LA County, California, USA ● Can we use public demographic data to predict chronic absenteeism rates? ○ Yes, for the years studied, well enough to identify areas with highest future risk and, perhaps, implement mitigating measures ● Opportunities for future work ○ Expand scope to include additional geographic areas, years, and predictor variables ○ Further automate data extraction and transformation ○ Consider additional algorithms, e.g. PCA, XGBoost, Polynomial Regression ○ Explore outliers where rate is more or less than expected ○ Explore effects of programs to mitigate chronic absenteeism o Links https://github.com/IverBand/carp https://www.coursera.org/specializations/advanced-data-science-ibm