SlideShare a Scribd company logo
1 of 24
Download to read offline
© 2016 IBM Corporation1
Academic Alert System
Presenter: Vinayak Agrawal
Vagrawal@us.ibm.com
© 2016 IBM Corporation2
Agenda
 Use Case
 Use Case Architecture/Work Flow in Weka
 Data Volume
 Problem Statement
 Our Analytical Platform
 Spark Workflow
 Result Comparison between Weka and Spark
 Spark Challenges
 Q&A
© 2016 IBM Corporation3
Use Case: Academic Alert System
 Academic Institutions get performance based funding on parameters* like
 Student Retention – Retention Rates
 Student Graduating – Completion Rates
 Academic Institutions wants to be proactive in providing academic
feedback to students BEFORE they appear in final exam.
*Source: http:///www.ncsl.org/research/education/performance-funding.aspx
Develop a ML model which has the capability to predict at-risk
(who might fail) students and provide this feedback to students
and Professors so that they can take appropriate actions
© 2016 IBM Corporation4
Use Case: Academic Alert System in Weka
© 2016 IBM Corporation5
Data Volume (in Prod)
Learning Management Systems
1) Student Activity data
Total = ~ 350 million records
Research = 15-18 million records
2) Student Gradebook data
Total = ~ 1.5 million
Research = 100,000 per semester
Student Information systems
1) Demographics
Research = 5500 students per semester x 3
2) Enrollment
Research = 27000 per semester x 3
3) Course
Research = ~2000 per semester x 3
© 2016 IBM Corporation6
Problem Statement
Small universities have less
students so Weka might work
Weka: it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already
been increased, because the Explorer always loads the entire dataset into the computer's main memory.
To scale out for Larger
Universities
How do I
process
45000
students with
20 features?
© 2016 IBM Corporation7
Analytical Platform
 Hardware:
 3 Virtual Machines on IBM PureFlex
• 8 cores per VM
• 32 GB RAM, 100GB per VM
 Software:
 3 node Hadoop cluster
• Spark 1.5.2: Zeppelin, Python, Scala
• Oozie, Hive and Sqoop
© 2016 IBM Corporation9
Spark Work Flow
Data
Training
Test
Sampling Train_DataImputation
ModelImputation Test_Data
Fit
Transform
Predictions
© 2016 IBM Corporation10
What does our Data Look like?
 Data Sources: Derived from ETL stage
 19 features from Learning Management System & Student
Demographics
Count:
Training: 9923
Testing: 5145
© 2016 IBM Corporation11
Sampling
Label Count
0.0 9267
1.0 656
Label Count
0.0 9267
1.0 9184
1.0 = Student At Risk
Training Data was skewed with only 656 At-Risk Students so we
duplicated At-Risk rows
TRAINING DATA
© 2016 IBM Corporation12
Imputation
 Filling with mean value for numerical columns
 Age
 SAT scores
 Filling with Mode value for Categorical columns
 Enrollment Status
© 2016 IBM Corporation13
Modelling Using Spark ML Package
Why?
DataFrame
Build the
Pipeline
Model
String Indexer for
Categorical Variables
Vector
Assembler
Use Model
4 Lines of Code
1 lr = LogisticRegression(maxIter=100, regParam=0.01)
2 pipeline_lr = Pipeline(stages=[ONLINE_FLAG_indexer, RC_GENDER_indexer,
RC_ENROLLMENT_STATUS_indexer,RC_CLASS_CODE_indexer, STANDING_indexer, assembler,
lr])
3 model_lr = pipeline_lr.fit(trainData)
4 prediction_lr = model_lr.transform(testData)
© 2016 IBM Corporation14
Logistic Regression Results
Predicted
Actual
0 1
0 4065 720
1 51 309
Spark:
 Test Data count: 5145
 19 Features
Weka:
 Test Data count: 5145
 19 Features
Predicted
Actual
0 1
0 4093 692
1 49 311
309 Students at Risk
85.01 % Accuracy
85.83 % Recall
Time: 20 seconds
311 Students at Risk
85.6 % Accuracy
86.4 % Recall
Time: 49 Seconds
© 2016 IBM Corporation15
Random Forest Comparison
Predicted
Actual
0 1
0 4065 720
1 51 309
Spark:
Data count: 5145
19 Features
Weka:
Data count: 5145
19 Features
Predicted
Actual
0 1
0 4186 599
1 83 277
309 Students at Risk
85.01 % Accuracy
85.83 % Recall
Time:16 Seconds
277 Students at Risk
86.7 % Accuracy
76.9 % Recall
Time:30 Seconds
© 2016 IBM Corporation16
Naive Bayes Comparison
Predicted
Actual
0 1
0 4279 506
1 158 202
Spark:
Data count: 5145
19 Features
Weka:
Data count: 5145
19 Features
Predicted
Actual
0 1
0 4093 692
1 67 293
202 Students at Risk
87.1 % Accuracy
56.1 % Recall
Time:9 Seconds
293 Students at Risk
85.2 % Accuracy
81.4 % Recall
Time:30 Seconds
© 2016 IBM Corporation17
Why is this Better?
Data
Training
Test
Sampling Train_DataImputation
ModelImputation Test_Data
Fit
Transform
Predictions
• Complete Work Flow in one Environment
Zeppelin on Spark
• Java/Scala or Python to choose from
• Distributed Computing
© 2016 IBM Corporation18
Spark Challenges
 No Python support to save and load pipeline model yet
• SPARK-6725, SPARK-13032
 ML StringIndexer does not protect itself from column name duplication
• SPARK-12874
 PySpark CrossValidatorModel does not support avgMetrics
• SPARK-12810
• You have to create an RDD and then extract the metrics
 PMML Export not supported yet
• SPARK-11171
© 2016 IBM Corporation19
Q&A
© 2016 IBM Corporation20
LOGISTIC REGRESSION MODEL
© 2016 IBM Corporation21
Random Forest Code
© 2016 IBM Corporation22
Naïve Bayes Code
© 2016 IBM Corporation23
Appendix
© 2016 IBM Corporation24
IBM Open Platform for Apache Hadoop (IOP)
 Includes Spark
 100% Open Source
 Implement with help from IBM Lab Services
 Production Support Offering Available
Apache Open Source Components
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
IBM Open Platform with Apache Hadoop
© 2016 IBM Corporation25
Questions??

More Related Content

Viewers also liked

How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache SparkOren Raboy
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionCloudera, Inc.
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project SparkFabrício Catae
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
Real Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and SparkReal Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and SparkQAware GmbH
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thessaloniki
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...Spark Summit
 

Viewers also liked (20)

How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache Spark
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to Production
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Real Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and SparkReal Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and Spark
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 

Similar to Apache Spark Use case for Education Industry

Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource SharingMaintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource SharingVladimir Podolskiy
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Nisha Talagala
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
Manoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentManoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentAgile Impact Conference
 
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019VMware Tanzu
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleDeep Kayal
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationTatsuhiro Chiba
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 
Introduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power SystemsIntroduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power SystemsDavid Spurway
 
The world of Machine Learning, Deep Learning and PowerAI
The world of Machine Learning, Deep Learning and PowerAIThe world of Machine Learning, Deep Learning and PowerAI
The world of Machine Learning, Deep Learning and PowerAIDavid Spurway
 
Quick review xAPI and IMS Caliper - Principle of both data capturing technolo...
Quick review xAPI and IMS Caliper - Principle of both data capturing technolo...Quick review xAPI and IMS Caliper - Principle of both data capturing technolo...
Quick review xAPI and IMS Caliper - Principle of both data capturing technolo...Open Cyber University of Korea
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developersNirmal Fernando
 
Splunk for Machine Learning and Analytics
Splunk for Machine Learning and AnalyticsSplunk for Machine Learning and Analytics
Splunk for Machine Learning and AnalyticsSplunk
 

Similar to Apache Spark Use case for Education Industry (20)

Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource SharingMaintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
 
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS ModelerUsing Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Manoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentManoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning Development
 
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at Scale
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Introduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power SystemsIntroduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power Systems
 
The world of Machine Learning, Deep Learning and PowerAI
The world of Machine Learning, Deep Learning and PowerAIThe world of Machine Learning, Deep Learning and PowerAI
The world of Machine Learning, Deep Learning and PowerAI
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Quick review xAPI and IMS Caliper - Principle of both data capturing technolo...
Quick review xAPI and IMS Caliper - Principle of both data capturing technolo...Quick review xAPI and IMS Caliper - Principle of both data capturing technolo...
Quick review xAPI and IMS Caliper - Principle of both data capturing technolo...
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
Splunk for Machine Learning and Analytics
Splunk for Machine Learning and AnalyticsSplunk for Machine Learning and Analytics
Splunk for Machine Learning and Analytics
 

Recently uploaded

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 

Recently uploaded (17)

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 

Apache Spark Use case for Education Industry

  • 1. © 2016 IBM Corporation1 Academic Alert System Presenter: Vinayak Agrawal Vagrawal@us.ibm.com
  • 2. © 2016 IBM Corporation2 Agenda  Use Case  Use Case Architecture/Work Flow in Weka  Data Volume  Problem Statement  Our Analytical Platform  Spark Workflow  Result Comparison between Weka and Spark  Spark Challenges  Q&A
  • 3. © 2016 IBM Corporation3 Use Case: Academic Alert System  Academic Institutions get performance based funding on parameters* like  Student Retention – Retention Rates  Student Graduating – Completion Rates  Academic Institutions wants to be proactive in providing academic feedback to students BEFORE they appear in final exam. *Source: http:///www.ncsl.org/research/education/performance-funding.aspx Develop a ML model which has the capability to predict at-risk (who might fail) students and provide this feedback to students and Professors so that they can take appropriate actions
  • 4. © 2016 IBM Corporation4 Use Case: Academic Alert System in Weka
  • 5. © 2016 IBM Corporation5 Data Volume (in Prod) Learning Management Systems 1) Student Activity data Total = ~ 350 million records Research = 15-18 million records 2) Student Gradebook data Total = ~ 1.5 million Research = 100,000 per semester Student Information systems 1) Demographics Research = 5500 students per semester x 3 2) Enrollment Research = 27000 per semester x 3 3) Course Research = ~2000 per semester x 3
  • 6. © 2016 IBM Corporation6 Problem Statement Small universities have less students so Weka might work Weka: it may be impossible to train models from large datasets using the Weka Explorer graphical user interface, even when the Java heap size has already been increased, because the Explorer always loads the entire dataset into the computer's main memory. To scale out for Larger Universities How do I process 45000 students with 20 features?
  • 7. © 2016 IBM Corporation7 Analytical Platform  Hardware:  3 Virtual Machines on IBM PureFlex • 8 cores per VM • 32 GB RAM, 100GB per VM  Software:  3 node Hadoop cluster • Spark 1.5.2: Zeppelin, Python, Scala • Oozie, Hive and Sqoop
  • 8. © 2016 IBM Corporation9 Spark Work Flow Data Training Test Sampling Train_DataImputation ModelImputation Test_Data Fit Transform Predictions
  • 9. © 2016 IBM Corporation10 What does our Data Look like?  Data Sources: Derived from ETL stage  19 features from Learning Management System & Student Demographics Count: Training: 9923 Testing: 5145
  • 10. © 2016 IBM Corporation11 Sampling Label Count 0.0 9267 1.0 656 Label Count 0.0 9267 1.0 9184 1.0 = Student At Risk Training Data was skewed with only 656 At-Risk Students so we duplicated At-Risk rows TRAINING DATA
  • 11. © 2016 IBM Corporation12 Imputation  Filling with mean value for numerical columns  Age  SAT scores  Filling with Mode value for Categorical columns  Enrollment Status
  • 12. © 2016 IBM Corporation13 Modelling Using Spark ML Package Why? DataFrame Build the Pipeline Model String Indexer for Categorical Variables Vector Assembler Use Model 4 Lines of Code 1 lr = LogisticRegression(maxIter=100, regParam=0.01) 2 pipeline_lr = Pipeline(stages=[ONLINE_FLAG_indexer, RC_GENDER_indexer, RC_ENROLLMENT_STATUS_indexer,RC_CLASS_CODE_indexer, STANDING_indexer, assembler, lr]) 3 model_lr = pipeline_lr.fit(trainData) 4 prediction_lr = model_lr.transform(testData)
  • 13. © 2016 IBM Corporation14 Logistic Regression Results Predicted Actual 0 1 0 4065 720 1 51 309 Spark:  Test Data count: 5145  19 Features Weka:  Test Data count: 5145  19 Features Predicted Actual 0 1 0 4093 692 1 49 311 309 Students at Risk 85.01 % Accuracy 85.83 % Recall Time: 20 seconds 311 Students at Risk 85.6 % Accuracy 86.4 % Recall Time: 49 Seconds
  • 14. © 2016 IBM Corporation15 Random Forest Comparison Predicted Actual 0 1 0 4065 720 1 51 309 Spark: Data count: 5145 19 Features Weka: Data count: 5145 19 Features Predicted Actual 0 1 0 4186 599 1 83 277 309 Students at Risk 85.01 % Accuracy 85.83 % Recall Time:16 Seconds 277 Students at Risk 86.7 % Accuracy 76.9 % Recall Time:30 Seconds
  • 15. © 2016 IBM Corporation16 Naive Bayes Comparison Predicted Actual 0 1 0 4279 506 1 158 202 Spark: Data count: 5145 19 Features Weka: Data count: 5145 19 Features Predicted Actual 0 1 0 4093 692 1 67 293 202 Students at Risk 87.1 % Accuracy 56.1 % Recall Time:9 Seconds 293 Students at Risk 85.2 % Accuracy 81.4 % Recall Time:30 Seconds
  • 16. © 2016 IBM Corporation17 Why is this Better? Data Training Test Sampling Train_DataImputation ModelImputation Test_Data Fit Transform Predictions • Complete Work Flow in one Environment Zeppelin on Spark • Java/Scala or Python to choose from • Distributed Computing
  • 17. © 2016 IBM Corporation18 Spark Challenges  No Python support to save and load pipeline model yet • SPARK-6725, SPARK-13032  ML StringIndexer does not protect itself from column name duplication • SPARK-12874  PySpark CrossValidatorModel does not support avgMetrics • SPARK-12810 • You have to create an RDD and then extract the metrics  PMML Export not supported yet • SPARK-11171
  • 18. © 2016 IBM Corporation19 Q&A
  • 19. © 2016 IBM Corporation20 LOGISTIC REGRESSION MODEL
  • 20. © 2016 IBM Corporation21 Random Forest Code
  • 21. © 2016 IBM Corporation22 Naïve Bayes Code
  • 22. © 2016 IBM Corporation23 Appendix
  • 23. © 2016 IBM Corporation24 IBM Open Platform for Apache Hadoop (IOP)  Includes Spark  100% Open Source  Implement with help from IBM Lab Services  Production Support Offering Available Apache Open Source Components HDFS YARN MapReduce Ambari HBase Spark Flume Hive Pig Sqoop HCatalog Solr/Lucene IBM Open Platform with Apache Hadoop
  • 24. © 2016 IBM Corporation25 Questions??