At the University of Calgary, we used real‐time data on student applications to provide the Enrolment Services with better predictive analytics on the students that were offered a place at the University. IR offices are well placed to leverage institutional data to make these predictions. Our knowledge of the data and analytical tools can make us leaders in predictive analytics at our institutions. This presentation will discuss the issues about developing the models, finding the best model and putting it to use. These lessons are applicable to applying these techniques to many situations.
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrolment
1. Individual Level Predictive Analytics
Improving Student Enrolment Outcomes
Stephen Childs, Institutional Analyst @sechilds
CIRPA/PNAIRP 2016, Kelowna, BC
November 7, 2016
Office of Institutional Analysis
2. Why Predictive Analytics and IR
Higher Education Institutions collect more data
IR offices have experts in institutional data
IR offices are seeking ways to add more value
Machine learning, predictive models are in the news
2
3. Opportunity… or Crisis?
Predictive Analytics are a different skill set
A different set of software tools required
You may be the only analyst working on this in your office
Requesters expect you to be the expert
Resistance to implementing insights from predictive analytics
4. The way forward
Add these skills to your IR toolkit
Find tools that work with your existing ones
Develop your understanding and expertise
Community of Practice
5. Learning Outcomes
Have a high-level understanding of what predictive analytics
does and how it works.
Have a concrete series of steps to follow.
Know the vocabulary of machine learning and statistical
modeling.
Know what tools can be used for this - and how they work
with existing tools
Know about how we select, test, train models for prediction
Learn some of the challenges in predictive modeling
6. Outline
Introduction (already done??)
Introduction to Machine Learning
Model Building Steps
Tool Overview
Customer Education
Challenges
Building Community
10. STEP 1: Define Your Goal
Sets the scope of your analysis
Provides input into model selection
Identifies stakeholders
Discover what data is available
Revise as the project progresses
11. STEP 2: Get Access to your Data
Three different types of data:
—Operational SIS
—Data Warehouse – snapshots
—Predictive Analytics Data
Talk to your DBA to find out tables
Think of other data to add:
—Residence, CRM
—Socio-economic data
12. STEP 3: Build an Analysis File
Extract – Transform – Load
—Use as much existing ETL as you can
—Join tables together
—Work with a programmer – but analyst drives
Hard to capture the timeline of the application
—When did they apply?
—When were they accepted?
—When did they register?
14. STEP 3: Build a Data Analysis File – Best Practices
Test your ETL process (automated is better)
Save your data in a database (existing one, SQLite)
Append rows to table and timestamp & use test indicator
Keep track of program version
Keep a changelog
Capture more data, then filter that for analysis
15. STEP 4: Develop a model
Student
Characteristics
Outcomes
Independent Variables
Features
Dependent Variable
function
algorithm
formula
16. STEP 4: Develop a Model – Things to Watch Out For
Missing data
Multiple models
Model testing
17. STEP 4: Develop a Model - Accuracy
Refer back to your goal – no universal measure of accuracy
Model used for decision making/resource allocation
Assign loss based on incorrect predictions – minimize it
Receiver Operating Characteristic (ROC) and Area Under the
Curve (AUC)
Bias-Variance Trade Off and Overfitting
18. STEP 5: Deliver Your Results
Set up delivery early
Meet with your audience – set expectations
How will the data be used – refer back to goal
Dashboards
Data files
19. STEP 5: Delivery to Students
Have to carefully present information to students
—Present a positive outlook
—Don’t personalize it – talk about a group of similar
students.
The factors in the model may be less deterministic than
unobserved factors.
Difference between causality and correlation.
Beware the self-fulfilling prophecy
21. Weapons of Math Destruction
Three factors make a model a WMD:
—Is the participant aware of the model? Is the model
opaque or invisible?
—Does the model work against the participant’s interest? Is
it unfair? Does it create feedback loops?
—Can the model scale?
22. Experience So Far
Longer than anticipated to get the data
Working with the data was a great learning experience
Automated process for harvesting data
Starting to work on the delivery end
24. Community of Practice
Predictive Analytics Roundtable
Mailing List – more discussion in future
http://mailman.ucalgary.ca/mailman/listinfo/predictive-l
Stephen.Childs@ucalgary.ca
@sechilds #CIRPA2016
PyData, other user groups
Hinweis der Redaktion
Different skills – data needs and setup are different. Predictive analytics are very different from reporting.
Terminology often comes from machine learning/computer science – IR more grounded in traditional statistics
Those requesting predictive analytics may not know much about what they want. (A familiar story in IR.) But you need to be the expert – they expect that.
Predictive analytics and machine learning to the toolkit
Find software tools that work with existing tools
Learn the vocabulary around this discipline
More understanding of machine learning, statistics and other stuff
Don’t work alone – develop a community of practice.
Analyst and Researcher - MA Economics from WLU
History with EPRI and uCalgary
Analyst role and technical skills vs. programmer/analyst
Machine learning comes out of computer science – different tradition and terminology
It is related to artificial intelligence
It is widely used in the tech world – it impacts your lif e on a daily basis
Statistics and Machine Learning - the two cultures paper & response
Statistics - assume a Data Generation Process and want to learn about that
Machine Learning - Algorithms applied to data - no such assumption
Goals are different
Types of Machine Learning
Supervised vs. Unsupervised Learning
Classification vs. Regression
Machine Learning Algorithms
OLS
Logistic Regression
Decision Trees - Random Forest
Ensemble Models
Define your goal
Get access to data
Build an analysis file
Develop a model
Deliver your results
It is best to start with a written document describing the goals of your project.
Otherwise you are really starting with an “unwritten” one – and that can cause confusion later.
e.g. Canadian and American constitutions vs. British
The operational SIS data is the source. You perform an ETL process to get the snapshot.
* How many people use snapshots?
* Of those, is the snapshot fields different from the SIS fields?
You will probably need to generate your own ETL process for the predictive analytics data.
If you want to do real time predictions – you need access to that operational data.
There may be a view already in place for you - that does most of what you want.
Ask – Raise your hand if you are comfortable joining database tables together – or any type of tables
Keep your hands up – also raise your hands if you have someone close by who can help you with thatYour job as the analyst is to keep your eye on the goal.
Working with a programmer lets you do pair programming. (Which is great if you can do it.)
Students can change their minds throughout the process – the university can reject them from a program.
Figure out the significant events in the person’s record and capture that time stamp. We also found out that the effective dates are not always WHEN something was added into the database!
Option 1 – use a programming language – Python, R, SQL
Option 2 – use a graphical data blending tool (for prototyping or the whole project)
Graphical tools are better for prototyping – getting you started quickly.
Code is the best – easier to maintain, easier to track changes, handles complexity better, but higher barrier to entry!!
There are a number of ETL “move” you can learn – they work regardless of the tool – and will be very useful in talking with programmers.
Focus on learning the moves, not the syntax. Draw diagrams!
Testing – you need to make sure your ETL is good – if you can automate this testing… you are ahead of the curve
Compare to individual records – see if your file makes sense
Modularize data transformations – so you can test with fake data that – cover likely cases
Databases are awesome – use an existing one or set up a simple one (SQLite) – you have this expertise at your institution!!
This lets you start creating DAILY snapshots – which will come in handy next year! Think about the table structure – talk to your collegues!!
Program version – git hash, version number (semantic versioning) -
What is a model? At it’s core, it is a way to relate the characteristics of students to their outcomes.
You can think of it as a formula that takes the data that you have – and modifies it to produce the outcome you want.
There are a number of different types of models, but most programs will give you the same interface to all types.
The important thing is understanding what the algorithm is doing – and how to “tune” it.
Missing data - grades example, geographic data, gender example.
Confusion matrix!
Make no assumptions – means you are assuming a false +ve is as bad as a false –ve.
Our models should never serve as a gatekeeper to services or access to education – the only case where that happens is an experiment – and you need to get REB approval for that.