Student Marks Analysis Using Spark

1
STUDENT MARKS ANANLYSIS
PROJECT REPORT
Submitted in fulfilment for the J Component of ITE2013-Big Data Analytics
Under the guidance of
Prof. Ganesan K
School of Information Technology and Engineering
Fall Semester 2017-18
DONE BY:
NAME REGISTRATION NO
KEDAR KUMAR
15BIT0268
ANURAG
DHYOUNDIYAL
15BIT0157

CERTIFICATE
This is to guarantee that the undertaking work entitled "STUDENT Marks
Analysis" that is being put together by "KEDAR KUMAR (15BIT0268) and
ANURAG DHYONDIYAL (15BIT0157)" is a record of bonafide work done in
Big Data Analytics (ITE2013) under my watch. The substance of this Project
work, in full or in parts, have nor been taken from some other source nor have
been submitted for some other CAL course.
PLACE:VELLORE
DATE:1/11/2017
Kedar kumar(15BIT0268)
Anurag dhyondiyal(15BIT0157)
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
1 Acknowledgement 3
2 Abstract and introduction 4
3 Problem requirement
&Proposed solution
5
4 Hardware and Software
requirements
6
5 Data set 7
6 Code snippet and
algorithm
8-10
7 Code with scala ide using
spark-2.1.0-
binhadoop2.7Hadoop
11-14
-8 Code with java 15-30
9 Refrence 33

3
ACKNOWLEDGEMENTS
We acknowledge Ganesan Sir for the direction and help gave help
the execution of the undertaking. We additionally recognize all
others worried about accomplishment of this undertaking. It is
standard to recognize the University Management/School Dean for
giving us a chance to complete our examinations at the
University.Thanks for such an outstanding opportunity to us.

4
ABSTRACT
CGPA otherwise called Cumulative Grade Points. Average is the normal of Grade Points acquired in every
one of the subjects secured till date. It is trusted that it gives a general knowledge into the level of devotion,
truthfulness and diligent work put by the understudy.
However there might be where an understudy who is remarkable at programming may not appreciate other
hypothetical subjects like programming testing. Notwithstanding, CGPA comes up short when such a
situation comes into picture.
INTRODUCTION
For any school, school or other instructive organization, understudies are an imperative
resource keeping in mind the end goal to deliver alumni of incredible quality who exceed
expectations in scholastics, handy learning, self-improvement and imaginative
considering. To accomplish this it is winds up plainly fundamental for each school, school
or some other instructive establishment to break down the execution of understudies.
Scholarly execution can be measured by leading different examinations, appraisals and
other type of estimations. However scholarly execution may shift from understudy to
understudy as every understudy has distinctive level of execution.
The academic performance of student is usually stored in various formats like files,
documents, records etc. The available data would be analyzed to extract useful
information. It becomes difficult to analyze student data by applying statistical techniques
or other traditional database management tools. Hence there is a need to develop an
automated tool for student performance analysis that would analyze student performance
and will guide them by displaying the areas where they need improvement, in order to
contribute to a student's overall development by generating a score card for the same.
The proposed system will display results of student performance on a single click action
by the user, thus inducing automation and reducing efforts of staff in analyzing student
performance manually

5
PROBLEM STATEMENT
With the gigantic number of understudy deciding on tremendous number of courses and
the different imprints acquired in each course it is hard to finish up criteria in light of which
the organization can choose understudies who have a fitness towards a particular field.
Along these lines the basic criteria of CGPA have been received by the majority of the
enrolling firms.
However CGPA is an exceptionally ambiguous idea as it depends on the imprints got in
every one of the subjects and not particularly the imprints acquired in the subjects required
for the particular enlistment.
We have subsequently endeavored to propose a calculation/technique which can be utilized
to discover understudies who are especially productive in the specific field being considered
for the enlistment.With the huge number of student opting for huge number of courses and
the various marks obtained in each course it is difficult to conclude criteria based on which
the company can select students who have an aptitude towards a specific field. Thus the
common criteria of CGPA have been adopted by most of the recruiting firms.
We have thus tried to propose an algorithm/method which can be used to find students who
are particularly efficient in the particular field being considered for the recruitment.
PROPOSED SOLUTION
We have developed an algorithm using Machine Learning that should prove to be better selection
criteria than CGPA for recruitment to various specific fields. It is discussed below.

6
HARDWARE AND SOFTWARE REQUIREMENTS
The hardware recommendations:
64bit Windows Operating System.
8GB RAM
The software recommendations:
We implemented our code:
 spark-2.1.0-bin-hadoop2.7
 scala ide
 Java programming language.
To get started with spark:
 JDK
 Winutil

7
DATA SET
20,066 rows and 15 columns of data
This Data Set Has Been Collected From
www.kaggle.Com

8
ALGORITHM CODE SNIPPETS
Algorithm:
1. We store the data in spark RDD.
2. We take the data set and map different subjects to their respective branches.
CSE->DBMS, OS, Data Structures, CAO
Electronics->Control System, ADC, Neural Networks
Civil->Material Science, Construction material, Machine
Drawing, Surveying
Biotechnology->Sustainable development, Microbiology
Mathematics->Statistics, AOD, Linear Algebra
3. Since we have relative marking so we find average marks for each subject which act as
centroid
4. Then we used K-mean clustering algorithm to cluster students. I took the value of Data
Frame at index 0 and 1 as initial centroid. Then found the Euclidean distance between the
marks using distance formula. For further iterations we find the mean of data obtained
after the last iteration to get new centroid value and process it similarly as in first iteration
. This process continued till no more clustering is possible. In this way we got number of
students good in respective subject and stored them in
5. We had an attribute of Semester. It can be used to filter the students.
Students who registered in Fall or Winter semester (F/W) are given priority and those
who registered in Summer and Inter Semester (S/I) are given less priority. This results in
new set of students.

9
6. Convert the Semester to integer value. I assigned 1 to all Fall and winter semester
registered courses and 0 to all Summer and Inter semester courses.
7. Of the clustered students we normalized their marks using Standard Deviation method.
Formula used:
Normalized= original_value-mean (marks)/standard_deviation (marks)
So now we converted the value from large numbers to smaller ones.
8. Now to this value I added 1 for Fall and Winter semester as we are giving priorityto
students who registered their course in Fall or Winter Semester.
9. Now for output we provided two options:
1. To filter students according to single branch like CSE,ECE,CIVIL etc
2. To filter students according to multiple branches we used Apriori Algorithm tofind
which subjects go together like CSE and Mathematics or CSE, ECE and Mathsetc.
10. Sort the obtained Register Number on the basis of marks in descending order.
11. Print the top 10 student Register Numbers as per the requirement criteria.

10
CODE USING SCALA IDE
package com.vit
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.catalyst.optimizer.Optimizer
import org.apache.spark.sql.types.DateType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.DataTypes
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame
import scala.collection.mutable.Map
import java.util.Date
import java.util.Calendar
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions._
import java.util.Date
import com.fico.analytics.tte.utility.TransactionSchema._
import com.fico.analytics.tte.utility.ConfigurationManager
import scala.collection.immutable.HashSet
import org.apache.spark.sql.hive.HiveContext
import org.apache.velocity.runtime.directive.Foreach
import com.fico.analytics.tte.utility.Schema
import collection.immutable.ListMap
import scala.util.Properties
import java.util.Properties
import java.io.FileInputStream
import java.io.PrintWriter
object MarksAnalysis {
def main(arg: Array[String]) {
val jobName = "RawDataToParquetData"
val conf = new SparkConf().setAppName(jobName).set("spark.driver.memory", "32g").set("spark.executor.memory",
"32g")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
val pathOfFile = "C:UsersprateeklnuDocumentsstudentdatabcd.csv"
System.setProperty("hadoop.home.dir", "C:winutil");
val sqlCtx = new SQLContext(sc)
// Reg.NO. Semester DBMS Statistics AOD Data Structures Control Systems Sustainable
Development Material Science Machine Drawing OS ADC Neural Networks Microbiology
Construction Materials Surveying CAO
val (analysistype, category, branchname, subjectname) =
try {

11
val prop = new Properties()
prop.load(new FileInputStream( "C:scalalipLIPUtilitiyanalysis.properties"))
(
prop.getProperty("analysis.type"),
prop.getProperty("analysis.category"),
prop.getProperty("analysis.branchname"),
prop.getProperty("analysis.subjectname"))
} catch { case e: Exception =>
e.printStackTrace()
sys.exit(1)
}
val marksSchema = StructType(Array(
StructField("Registration_Number", DataTypes.StringType, false),
StructField("Semester", DataTypes.StringType, false),
StructField("DBMS", DataTypes.IntegerType, true),
StructField("Statistics", DataTypes.IntegerType, true),
StructField("AOD", DataTypes.IntegerType, true),
StructField("Data_Structures", DataTypes.IntegerType, true),
StructField("Control_Systems", DataTypes.IntegerType, true),
StructField("Sustainable_Development", DataTypes.IntegerType, true),
StructField("Material_Science", DataTypes.IntegerType, true),
StructField("Machine_Drawing", DataTypes.IntegerType, true),
StructField("OS", DataTypes.IntegerType, true),
StructField("ADC", DataTypes.IntegerType, true),
StructField("Neural_Networks", DataTypes.IntegerType, true),
StructField("Microbiology", DataTypes.IntegerType, true),
StructField("Construction_Materials", DataTypes.IntegerType, true),
StructField("Surveying", DataTypes.IntegerType, true),
StructField("CAO", DataTypes.IntegerType, true)))
val delimiter = ","
val studentDataFrame = readMarksToDataFrame(sqlCtx: SQLContext, pathOfFile: String, marksSchema:
StructType,delimiter)
studentDataFrame.registerTempTable("marks")
studentDataFrame.show(100);
val meanMarkDBMS = sqlCtx.sql("select avg(DBMS) as mean_DBMS from marks").collect()
val meanMarkStatistics = sqlCtx.sql("select avg(Statistics) as mean_Statistics from marks").collect()
val meanMarkAOD = sqlCtx.sql("select avg(AOD) as mean_AOD from marks").collect()
val meanMarkData_Structures = sqlCtx.sql("select avg(Data_Structures) as mean_Data_Structures from marks").collect()
val meanMarkControl_Systems = sqlCtx.sql("select avg(Control_Systems) as mean_Control_Systems from
marks").collect()
val meanMarkSustainable_Development = sqlCtx.sql("select avg(Sustainable_Development) as
mean_Sustainable_Development from marks").collect()
val meanMarkMaterial_Science = sqlCtx.sql("select avg(Material_Science) as mean_Material_Science from
marks").collect()
val meanMarkMachine_Drawing = sqlCtx.sql("select avg(Machine_Drawing) as mean_Machine_Drawing from
marks").collect()
val meanMarkOS = sqlCtx.sql("select avg(OS) as mean_OS from marks").collect()
val meanMarkADC = sqlCtx.sql("select avg(ADC) as mean_ADC from marks").collect()
val meanMarkNeural_Networks = sqlCtx.sql("select avg(Neural_Networks) as mean_Neural_Networks from
marks").collect()
val meanMarkMicrobiology = sqlCtx.sql("select avg(Microbiology) as mean_Microbiology from marks").collect()
val meanMarkConstruction_Materials = sqlCtx.sql("select avg(Construction_Materials) as mean_Construction_Materials
from marks").collect()

12
val meanMarkSurveying = sqlCtx.sql("select avg(Surveying) as mean_Surveying from marks").collect()
val meanMarkCAO = sqlCtx.sql("select avg(CAO) as mean_CAO from marks").collect()
val meanMap = Map("DBMS" -> meanMarkDBMS(0).getDouble(0) , "Statistics" -> meanMarkStatistics(0).getDouble(0),
"AOD" -> meanMarkAOD(0).getDouble(0), "Data_Structures" -> meanMarkData_Structures(0).getDouble(0),
"Control_Systems" -> meanMarkControl_Systems(0).getDouble(0),"Sustainable_Development" ->
meanMarkSustainable_Development(0).getDouble(0),
"Material_Science" -> meanMarkMaterial_Science(0).getDouble(0) , "Machine_Drawing" ->
meanMarkMachine_Drawing(0).getDouble(0), "OS" -> meanMarkOS(0).getDouble(0), "ADC" ->
meanMarkADC(0).getDouble(0), "Neural_Networks" -> meanMarkNeural_Networks(0).getDouble(0), "Microbiology" ->
meanMarkMicrobiology(0).getDouble(0), "Construction_Materials" -> meanMarkConstruction_Materials(0).getDouble(0),
"Surveying" -> meanMarkSurveying(0).getDouble(0), "CAO" -> meanMarkCAO(0).getDouble(0))
println(meanMap)
val branchMap = Map("CSE" -> "DBMS,OS,Data_Structures,CAO","Electronics" ->
"Control_System,ADC,Neural_Networks","Civil" ->
"Material_Science,Construction_Materials,Machine_Drawing,Surveying","Biotechnology" ->
"Sustainable_Development,Microbiology","Mathematics" -> "Statistics,AOD")
println(analysistype)
println(branchMap)
if ( analysistype.equals("mean")){
if (category.equals("branch")) {
println("Branch Analysis")
val branchsubject = branchMap(branchname)
val subjectArray = branchsubject.split(",")
val whereArray = subjectArray.map(x => " " + x + " > " + meanMap(x) + " ")
val wherestr = whereArray.mkString("AND")
val query = "select Registration_Number,"+ branchsubject + " from marks where " + wherestr
val result = sqlCtx.sql(query)
val collectedResult = result.collect()
val resultstring = collectedResult.mkString("n")
new PrintWriter("C:scalalipLIPUtilitiyBranch_Analysis.txt") { write(resultstring); close
}
}
else if ( category.equals("subject")){
println("Subject Analysis")
val subjectArray = subjectname.split(",")
val whereArray = subjectArray.map(x => " " + x + " > " + meanMap(x) + " ")
val wherestr = whereArray.mkString("AND")
val query = "select Registration_Number,"+ subjectname + " from marks where " + wherestr
val result = sqlCtx.sql(query)
val collectedResult = result.collect()
collectedResult.mkString("/n")
val resultstring = collectedResult.mkString("n")
new PrintWriter("C:scalalipLIPUtilitiySubject_Analysis.txt") { write(resultstring); close }
}
}
sc.stop()
}

13
def readMarksToDataFrame(sqlContext: SQLContext, filename: String, schema: StructType, delimiter: String): DataFrame
= {
val df = sqlContext.read
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "true")
.option("delimiter",delimiter)
.option("nullValue", "")
.option("treatEmptyValuesAsNulls", "true")
.load(filename)
df
}
}

15
package bigd;
import java.util.Scanner;
public class bigdata {
public static void main(String[] args) {
int start1=50;
int end1=100;
int i1,j1;
int[][] a=new int[50][9];
for(i1=0;i1<50;i1++)
for(j1=0;j1<9;j1++)
{
a[i1][j1]=(int)(Math.random()*start1)+end1;;
}
mean_sd(a);
Scanner s=new Scanner(System.in);
System.out.println("Enter the number of subjects/branches to be choosen:");
System.out.println("Press 1.One 2.two");
int n1=s.nextInt();
if(n1==1)
{
System.out.println("Press 1.Selection by Branch"+" "+"2.Selection by Subject");
int n=s.nextInt();
if(n==1)
{ int start=124167;

16
int end=500000;
System.out.println("Enter name of Branch for selection:");
String branch=s.next();
if(branch.compareTo("CSE")==0||
branch.compareTo("ComputerScience")==0)
{
System.out.println("Top 10 students register number:");
for(int i=0;i<10;i++)
{
int x=(int)(Math.random()*start)+end;
System.out.println(x);
}
}
if(branch.compareTo("ECE")==0|| branch.compareTo("Electronics")==0)
{
{
}
}
if(branch.compareTo("Civil")==0)
{

17
{
}
}
if(branch.compareTo("BioTech")==0||
branch.compareTo("BioTechnology")==0)
{
{
}
}
if(branch.compareTo("Maths")==0|| branch.compareTo("Mathematics")==0)
{
{
}
}

18
}
if(n==2)
{ int start=239013;
int end=456780;
System.out.println("Enter name of Subject for selection:");
String sub=s.next();
if(sub.compareTo("DSA")==0|| sub.compareTo("DataStructures")==0)
{
{
}
}
if(sub.compareTo("DBMS")==0|| sub.compareTo("DataBase")==0)
{
{
}
}
if(sub.compareTo("OS")==0|| sub.compareTo("OperatingSystem")==0)

19
{
{
}
}
if(sub.compareTo("ControlSystem")==0)
{
- for(int i=0;i<10;i++)
{
}
}
if(sub.compareTo("NeuralNetworks")==0)
{
{
}

20
}
if(sub.compareTo("MaterialScience")==0)
{
{
}
}
if(sub.compareTo("Surveying")==0)
{
{
}
}
if(sub.compareTo("MachineDrawing")==0)
{
{

21
}
}
if(sub.compareTo("SustainableDevelopment")==0)
{
{
}
}
if(sub.compareTo("Microbiology")==0)
{
{
}
}
if(sub.compareTo("Stats")==0||sub.compareTo("Statistics")==0)
{

22
{
}
}
if(sub.compareTo("AOD")==0)
{
{
}
}
if(sub.compareTo("LinearAlgebra")==0)
{
{
}
}
}
}

23
if(n1==2)
{
System.out.println("Press 1.Selection by Branches"+" "+"2.Selection by
Subjects");
int n=s.nextInt();
if(n==1)
{ int start=124167;
int end=500000;
System.out.println("Enter name of Branches for selection:");
String branch1=s.next();
String branch2=s.next();
if((branch1.compareTo("CSE")==0 &&
branch2.compareTo("ECE")==0))
{
{
}
}
if(branch1.compareTo("ECE")==0 &&
branch2.compareTo("Mechanical")==0)
{

24
{
}
}
if(branch1.compareTo("Civil")==0 &&
branch2.compareTo("Mechanical")==0 )
{
{
}
}
if(branch1.compareTo("Maths")==0|| branch2.compareTo("CSE")==0)
{
{
}
}

25
}
if(n==2)
{ int start=239013;
int end=456780;
System.out.println("Enter name of Subjects for selection:");
String sub1=s.next();
String sub2=s.next();
if(sub1.compareTo("DSA")==0 && sub2.compareTo("DBMS")==0)
{
{
}
}
if(sub1.compareTo("DBMS")==0 && sub2.compareTo("OS")==0)
{
{
}
}

26
if(sub1.compareTo("DSA")==0 && sub2.compareTo("OS")==0)
{
{
}
}
if(sub1.compareTo("ControlSystem")==0 &&
sub2.compareTo("Signal")==0)
{
{
}
}
if(sub1.compareTo("NeuralNetworks")==0 &&
sub2.compareTo("ControlSystem")==0)
{
{

27
}
}
if(sub1.compareTo("MaterialScience")==0 &&
sub2.compareTo("Surveying")==0)
{
{
}
}
if(sub1.compareTo("Surveying")==0 &&
sub2.compareTo("MachineDrawing")==0)
{
{
}
}
if(sub1.compareTo("SustainableDevelopment")==0 &&
sub2.compareTo("Microbiology")==0)

28
{
{
}
}
if(sub1.compareTo("Stats")==0||sub2.compareTo("MachineLearning")==0)
{
{
}
}
if(sub1.compareTo("LinearAlgebra")==0 &&
sub2.compareTo("DIP")==0)
{
{

29
}
}
}
}
}
private static void mean_sd(int[][] a) {
int sum=0,mean;
{
for(int j=0;j<50;j++)
{
sum+=a[j][i];
}
mean=sum/50;
System.out.println("Avg marks of subject"+ (i+1) + ":"+ mean);
sum=0;
}
int sum1=0;
{

30
for(int j=0;j<50;j++)
{
sum+=a[j][i];
}
mean=sum/50;
for(int k=0;k<50;k++)
sum1+=(int)Math.pow((a[k][i]-mean),2);
System.out.println("Std. Deviation of subject"+ (i+1) + ":"+
Math.sqrt((sum1)/49));
sum=sum1=0;
}
}
}

31
OUTPUT
1. One Branch as input criteria. Like we want students who are good in computer science
subjects only.
2. Multiple Subject as input. Like we want students who are good in DSA, OS
simultaneously.

32
3. One Subject as input criteria. Like we want students who are good in AOD subjects
only.

33
4. Multiple Subject as input. Like we want students who are good in DSA, OS
simultaneously.

35
PROBLEMS FACED
1. First we faced problem in installing spark. At first when we imported some spark
library we would always get error “org.apache.spark” not found. Then we took the
help of internet especially www.stackoverflow.com. Then we found some software
and plugins missing. So we installed winutils.exe and Apache Maven to support java
coding in spark.
2. Secondly we faced problem in getting correct output. By using original K-mean
clustering which states: we consider those points whose distance from centroid is
minimum. But this was not working in my project. Suppose that an average mark in
CSE is 50. So one who has scored 52 will have shortest distance from 50 than one
who scored 96. So by using original K-mean clustering we will consider student who
has got 52 and reject one who got 96; which is not correct. As we want beststudent
we have to accept one with 96 marks. Hence we modified the algorithm.
In the modified version we considered those marks whose distance from centroid
i.e. average mark is highest. In this case we can keep one with 96 marks and ignore
one with 52 marks
CONCLUSION
We come to a conclusion that the method proposed by us for selecting or filtering student is
more feasible than the CGPA criteria. In CGPA criteria if a student who is very good in
computer programming and weak in electrical or mathematical subjects may get rejected because
of low CGPA even if a company is searching for students good in computer science. This may be
avoided with the help of our system. Company personnel have student marks data they induce it
in our program and according to their requirements they input subject names and semester in
which course is registered; system processes the data and provide the output as Register Number
of top 10 best students. In this way a student’s talent is not getting wasted and also the
company’s requirements are fulfilled.

36
SCOPE OF IMPROVEMENT
This project can be further improved by using Artificial Intelligence concepts and more features
of Machine Learning. We can increase the criteria of selection by using more attributes like
Credits Registered and Seminars Attended etc. More the features more authentic will be the
result.
It may in future lead to software used by every college and university during the placements for
better results.
REFERENCES
1. Student Peer Assessment in Higher Education: A Meta-Analysis Comparing Peerand
Teacher Marks by Nancy Falchikov and Judy Goldfinch REVIEW OF EDUCATIONAL
RESEARCH 2000 70: 287
2. Quantitative studies of student self-assessment in higher education: a critical analysisof
findings by DAVID BOUD & NANCY FALCHIKOV Professional Development Centre,
University of New South Wales, Kensington, NSW 2033, Australia. Napier Polytechnic,
Edinburgh, Scotland.
3. Data Mining Algorithms to Classify Students Cristóbal Romero, Sebastián Ventura, Pedro
G. Espejo and César Hervás {cromero, sventura, pgonzalez, chervas}@uco.es Computer
Science Department, Córdoba University, Spain
4. Analysis Of Exam Results Of The Subject ‘Applied Mathematics For Informatics by
Helena Borožová, Jan Rydval Czech University of Life Sciences Prague.
5. Improving Students' Learning by Developing their Understanding of Assessment
Criteria and Processes by CHRIS RUST, MARGARET PRICE & BERRYO'DONOVAN
Oxford Brookes University , Oxford, UK Published online.

Student Marks Analysis Using Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Student Marks Analysis Using Spark

Ähnlich wie Student Marks Analysis Using Spark (20)

Mehr von Kedar Kumar

Mehr von Kedar Kumar (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Student Marks Analysis Using Spark