Apache Spark Use case for Education Industry

© 2016 IBM Corporation1
Academic Alert System
Presenter: Vinayak Agrawal
Vagrawal@us.ibm.com

Agenda
 Use Case
 Use Case Architecture/Work Flow in Weka
 Data Volume
 Problem Statement
 Our Analytical Platform
 Spark Workflow
 Result Comparison between Weka and Spark
 Spark Challenges
 Q&A

Use Case: Academic Alert System
 Academic Institutions get performance based funding on parameters* like
 Student Retention – Retention Rates
 Student Graduating – Completion Rates
 Academic Institutions wants to be proactive in providing academic
feedback to students BEFORE they appear in final exam.
*Source: http:///www.ncsl.org/research/education/performance-funding.aspx
Develop a ML model which has the capability to predict at-risk
(who might fail) students and provide this feedback to students
and Professors so that they can take appropriate actions

Use Case: Academic Alert System in Weka

Data Volume (in Prod)
Learning Management Systems
1) Student Activity data
Total = ~ 350 million records
Research = 15-18 million records
2) Student Gradebook data
Total = ~ 1.5 million
Research = 100,000 per semester
Student Information systems
1) Demographics
Research = 5500 students per semester x 3
2) Enrollment
Research = 27000 per semester x 3
3) Course
Research = ~2000 per semester x 3

Problem Statement
Small universities have less
students so Weka might work
Weka: it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already
been increased, because the Explorer always loads the entire dataset into the computer's main memory.
To scale out for Larger
Universities
How do I
process
45000
students with
20 features?

Analytical Platform
 Hardware:
 3 Virtual Machines on IBM PureFlex
• 8 cores per VM
• 32 GB RAM, 100GB per VM
 Software:
 3 node Hadoop cluster
• Spark 1.5.2: Zeppelin, Python, Scala
• Oozie, Hive and Sqoop

Spark Work Flow
Data
Training
Test
Sampling Train_DataImputation
ModelImputation Test_Data
Fit
Transform
Predictions

What does our Data Look like?
 Data Sources: Derived from ETL stage
 19 features from Learning Management System & Student
Demographics
Count:
Training: 9923
Testing: 5145

Sampling
Label Count
0.0 9267
1.0 656
Label Count
0.0 9267
1.0 9184
1.0 = Student At Risk
Training Data was skewed with only 656 At-Risk Students so we
duplicated At-Risk rows
TRAINING DATA

Imputation
 Filling with mean value for numerical columns
 Age
 SAT scores
 Filling with Mode value for Categorical columns
 Enrollment Status

Modelling Using Spark ML Package
Why?
DataFrame
Build the
Pipeline
Model
String Indexer for
Categorical Variables
Vector
Assembler
Use Model
4 Lines of Code
1 lr = LogisticRegression(maxIter=100, regParam=0.01)
2 pipeline_lr = Pipeline(stages=[ONLINE_FLAG_indexer, RC_GENDER_indexer,
RC_ENROLLMENT_STATUS_indexer,RC_CLASS_CODE_indexer, STANDING_indexer, assembler,
lr])
3 model_lr = pipeline_lr.fit(trainData)
4 prediction_lr = model_lr.transform(testData)

Logistic Regression Results
Predicted
Actual
0 1
0 4065 720
1 51 309
Spark:
 Test Data count: 5145
 19 Features
Weka:
 Test Data count: 5145
 19 Features
Predicted
Actual
0 1
0 4093 692
1 49 311
309 Students at Risk
85.01 % Accuracy
85.83 % Recall
Time: 20 seconds
85.6 % Accuracy
86.4 % Recall
Time: 49 Seconds

Random Forest Comparison
Predicted
Actual
0 1
0 4065 720
1 51 309
Spark:
Data count: 5145
19 Features
Weka:
Data count: 5145
19 Features
Predicted
Actual
0 1
0 4186 599
1 83 277
85.01 % Accuracy
85.83 % Recall
Time:16 Seconds
86.7 % Accuracy
76.9 % Recall
Time:30 Seconds

Naive Bayes Comparison
Predicted
Actual
0 1
0 4279 506
1 158 202
Spark:
Data count: 5145
19 Features
Weka:
Data count: 5145
19 Features
Predicted
Actual
0 1
0 4093 692
1 67 293
87.1 % Accuracy
56.1 % Recall
Time:9 Seconds
85.2 % Accuracy
81.4 % Recall
Time:30 Seconds

Why is this Better?
Data
Training
Test
Sampling Train_DataImputation
ModelImputation Test_Data
Fit
Transform
Predictions
• Complete Work Flow in one Environment
Zeppelin on Spark
• Java/Scala or Python to choose from
• Distributed Computing

Spark Challenges
 No Python support to save and load pipeline model yet
• SPARK-6725, SPARK-13032
 ML StringIndexer does not protect itself from column name duplication
• SPARK-12874
 PySpark CrossValidatorModel does not support avgMetrics
• SPARK-12810
• You have to create an RDD and then extract the metrics
 PMML Export not supported yet
• SPARK-11171

LOGISTIC REGRESSION MODEL

Random Forest Code

Naïve Bayes Code

Appendix

IBM Open Platform for Apache Hadoop (IOP)
 Includes Spark
 100% Open Source
 Implement with help from IBM Lab Services
 Production Support Offering Available
Apache Open Source Components
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
IBM Open Platform with Apache Hadoop

Questions??

Apache Spark Use case for Education Industry

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Apache Spark Use case for Education Industry

Similar to Apache Spark Use case for Education Industry (20)

Recently uploaded

Recently uploaded (17)

Apache Spark Use case for Education Industry