Brazil's university ranking a prediction study with machine learning 234 ifkad2018

Brazil's University Ranking: a Prediction Study with
Machine Learning
Sérgio Nicolau da Silva
Departamento Sistemas
Instituto Federal de Educação, Ciência e Tecnologia de Santa Catarina -
IFSC
Rua 15 de Julho, 150 - Coqueiros. Florianópolis, SC. Brazil. CEP
88070-010
Cleverson Tabajara Vianna *
Departamento de Saúde e Serviço
IFSC
Av. Mauro Ramos, 950 - Centro. Florianópolis, SC. Brazil. CEP 88020-
300.
Fernando Alvaro Ostuni Gauthier
Departamento de Engenharia e Gestão do Conhecimento
Universidade Federal de Santa Catarina - UFSC
Campus Reitor João David Ferreira Lima, s/n - Trindade, Florianópolis
- SC. Brazil, CEP 88040-900
Antônio Pereira Cândido
Departamento de Saúde e Serviço
IFSC
Av. Mauro Ramos, 950 - Centro. Florianópolis, SC. Brazil. CEP 88020-
300
* Corresponding author
Structured Abstract
How to distinguish the best or worst institutions of higher education? This is a question
that permeates the minds and hearts of parents, students, and teachers because education
is an investment in the personal and nation's future. As a source of information for the
response to asking, the University Ranking of Folha - RUF appears. Known for its
traditional evaluation, the Folha's Ranking is considered an independent evaluation tool
and provides a ranking of the best Brazilian universities. 74% of the data are related to
research areas and postgraduate programs. Who regulates and supervises the postgraduate
609
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X

programs in Brazil is CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível
Superior), authorizing or not the program, assigning a score from 1 to 7, with 7 being the
best score. Your data for this evaluation is published. In this article, are using machine
learning techniques based on Naïve Bayes algorithms. CAPES data and the Folha's
Ranking of previous years are used as the training mass for the machine Naïve Bayes
algorithm. After the training, CAPES data from 2015 was applied to predict the 2016
Ranking with a hit rate of 61.5%. A percentage above 60 of the Folha's Ranking shows
that it is possible, with a more detailed study and analysis of the techniques, to predict
with a certain confidence. It should be noted that according to the Folha's Ranking roles,
the Scientific Research (mostly postgraduate) corresponds to a weight of 42% in the
ranking.
Purpose – The use of Machine Learning techniques to predict the Ranking Universitário
da Folha (RUF), using previous year's history to train the Naïve Bayes algorithm
Design/methodology/approach – Applied research, descriptive, exploratory objective
and qualitative and quantitative abortion. Data were extracted from the RUF, CAPES,
homogenized, and engineering methods were applied using several tools (WEKA, KDD,
Data Mining, Postgres, ETL Pentaho)
Originality/value – Use of machine learning techniques to gauge/predict the quality of
Higher Education, an index that is inserted in a complex and interdisciplinary context.
Practical implications – Proposes a statistical-based model to determine the quality of an
educational institution
Keywords – Brazil, University, Ranking, Naïve Baye, Machine Learning.
Paper Type: Academic Research Paper
1 Introduction
Quality of Higher Education, assessment, the ranking of the best Universities are
topics to be tackled in this article, however, even given the relevance of the theme, the
central point explored, is the use of mining algorithms to predict this ranking. However, it
is mandatory to contextualize the emergence and importance of these rankings.
The preliminary topic, which highlights the relevance of the theme, refers to the
quality of Higher Education.
Quality of education of Brazilian Universities has become a central theme for the
country. In the educational area, this term is not consolidated and is not a standard
ground. But for all practical purposes, the lack of understanding of the concept is not a
problem. Moreover, the idea of quality is not even put into the focus of discussion.
Together with the quality theme, questions about guarantee quality and accreditation
arise. (Sobrinho, 2008)
610
ISBN 978-88-96687-11-6
ISSN 2280787X

Since the 90's, most of the Latin American countries have set up their bodies for the
evaluation of education quality of universities (Sobrinho, 2006). In Brazil, the
accreditation, which in Brazil ultimately means "operating authorization", is a
governmental assignment regulated by the Sistema Nacional de Avaliação da Educação
Superior - SINAES - National System for the Evaluation of Higher Education (Southern
and Vessuri, 2006; Rish, 2001).
Since all Universities in Brazil must have an accreditation, or government
authorization to act, how to distinguish the best or the worst? This is a question that
pervades the minds and hearts of parents, students, and teachers, as education is an
investment in the personal and the future of the nation. There are some resources
available, but being governmental, have the same origin and do not evidence an
independence of evaluation as having the origin in the own society.
Precisely in this "information vacuum", the Ranking Universitário da Folha (Folha's
University Ranking) - RUF appears. Known for its traditional evaluation, the Folha's
Ranking is considered as an independent evaluation tool and provides a ranking of the
best Brazilian universities. The RUF is developed under the responsibility of Folha de
São Paulo (started in 1921), and use several mechanisms, aiming to rank the 195 best
universities in the country, public or private. Its execution is in charge of DATAFOLHA.
According to Folha de São Paulo (2016), in its own website we have: The RUF
evaluates the 195 Brazilian universities based on 5 indicators: Scientific research; Quality
of Teaching; Internationalization; Labour market; Innovation.
Data are obtained from a variety of sources, including two annual surveys,
encompassing thousands of respondents, and data are collected from such sources as:
a. Inep-MEC
b. Web of Science
c. SciELO
d. Inpi
e. FAPs
f. CNPq
g. Capes
h. Two Datafolha
surveys done
annually
611
ISBN 978-88-96687-11-6
ISSN 2280787X

The question that motivated us to this research is:
With what degree of certainty, by analysing only the data provided by CAPES,
concerning the data of graduate program of Universities, can we predict whether or not a
university will be in the RUF?
To do so, we use Knowledge Engineering to establish this ranking. We use tools and
techniques of Data Mining, Classification, Machine Learning and Recommendation and
Prediction and Probability Algorithms.
2 Theoretical Construction
Research in universities is usually associated with research groups, led by, in most,
Ph.Ds. Thus, it is plausible to hypothesize that the influence of the structure and
functioning of postgraduate programs is high in the RUF, even more than "research and
teaching quality" are relevant parts of the RUF, as presented in analyzing the construction
of the RUF and its structure. In this section, we look at how the RUF is built and the Data
Mining tools that will support the experiment.
We briefly describe what the RUF is and how it is composed.
2.1 Structure of the RUF and the open data of CAPES
When analyzing the structure of formation of the RUF, we have that 74% of the data
turn directly to Scientific Research (42%) and Quality of Education (32%), with the other
topics Labour Market, Internationalization, and Innovation, if sum together represents the
remaining 26%. In view of this, with a predominance of data related to research and as
research is generally attributed to the postgraduate program, we came to the perception
that although the ranking is aimed at undergraduate, the data of the Coordenação de
Aperfeiçoamento de Pessoal de Nível Superior (Coordination of Improvement of Level
Personnel) - CAPES, could have important weight for the ranking.
For Gonçalves (2006) several basic approaches to statistics are proposed for machine
prediction and learning, which use clustering algorithms to establish patterns: k-means
and Bayesians are examples.
Bayes was an 18th century English philosopher who expounded his theory of
probability in 1763. The rule that bears his name has been a cornerstone of probability
theory ever since. The difficulty with applying Bayes rule in practice is the attribution of
prior probabilities (Witten and Frank, 2011).
In this research, the Naive Bayes algorithm was used, with the Supervised Learning
approach, which is based on probabilistic methods (Fulmari and Chandak, 2014). The
CAPES open data of 2014 and RUF 2015 were used as the training and the prediction.
Based on this, of RUF-2016 was predicted using CAPES's open data of 2015. The RUF
prediction (by the algorithm) for the 2016 year was then compared with the results
612
ISBN 978-88-96687-11-6
ISSN 2280787X

published by the RUF. Also through the algorithm J48, it was tried to establish a decision
tree, but that due to the high number of branches is not feasible and with the "pruning",
becomes insignificant. As a tool, we use WEKA.
2.2 Discovery of Knowledge
Knowledge discovery is the process applied to structured, semi-structured and
unstructured data, with the purpose of verifying the hypothesis of users or the discovery
of new patterns. It can be further subdivided into two other objectives: the prediction of
future behavior based on the analysis of historical data and the presentation of patterns
identified in the data analysis (Fayyad and PIatetskY-Shapiro and SMYTH, 1996).
Knowledge Discovery in Database (KDD) is the area that has mechanisms and
techniques for structured data analysis. KDD can be seen as a multidisciplinary activity,
as it encompasses techniques and beyond the scope of the discipline, such as machine
learning (Fayyad, Piatetsky-Shapiro and Smyth, 1996). As part of KDD, Data Mining acts
on extracting useful database information.
2.2 Data mining
Nowadays, world the volume of digital data stored in electronic repositories grows at
a fast pace, making a major migration of software companies to act on big data
technologies and open data, according to studies published in 2015 from IDC: Worldwide
Technology Big Data and Forecast Services, 2015-2019 (IDC # 259532) and the
Worldwide Big Data Forecast by Vertical Market, 2014-2019 (IDC # US40544915).
The term Data Mining has been used by statisticians, data analysts and communities
of information systems in the management area and most popularly directly related to
database (Fayyad, Piatetsky-Shapiro and SMYTH, 1996). Such a data mining process is
supported by techniques that act in the training and testing based on historical data, thus
recognizing patterns. This method is characteristic of machine-learning techniques for the
recognition of patterns such as classification, clustering, clustering, among others
(Fayyad, PIatetsky-Shapiro and Smyth, 1996).
Each of the techniques has a variety of possible algorithms available and their
variations. For the purpose of this article will be approached the classification by means
of the Naïve-Bayes algorithm.
2.3 Classification
Classification is a process that we are constantly carrying out throughout history and
in our daily. We classify the transportation facilities by air, land, and sea, people of legal
age and minors legal age and the economic classes of the population are some examples.
613
ISBN 978-88-96687-11-6
ISSN 2280787X

The classification process consists of examining the characteristics of a certain object
to be classified and assigning it one or more classes (Linoff and Berry, 2011). When the
age of a person is presented, for example, by applying the current majority rule, it is
possible to classify the individual as a major or minor.
In data mining, the objects to be sorted are usually represented by records in a
database table or a file, in which a column that represents your class is added. The task of
classification is characterized by a definition of distinct classes that are identified from a
training set composed of pre-classified examples (Linoff and Berry, 2011).
Classification alone is not enough in complex cases for automated decision making,
but it is an excellent guideline for decision-making in intensive knowledge activities.
Thus, in seeking to identify the risk of the client in fulfilling its obligations, the technique
seeks to predict the future. For example, based on past experience, you can establish in a
financial institution which risks/confidence to receive loans.
"Any of the techniques used for classification and estimation can be adapted for
use in prediction by using training examples where the value of the variable to be
predicted is already known, along with historical data for those examples. The historical
data is used to build a model that explains the current observed behavior. When this
model is applied to current inputs, the result is a prediction of future behavior" (Linoff
and Berry, 2011).
In this direction, the classification for the case of study in question is applied.
There are numerous algorithms for classification of information. ID3 and C4.5 are
some examples of classification algorithms, which use a symbolic approach1
. Other
algorithms like Naïve Bayes and K-Neighboards have a statistical2
approach and several
implementations. The WEKA3
For the case study that foresees case-based prediction, Naïve Bayes algorithm was
used. "Naïve Bayes is a popular technique for this application because it is very fast and
quite accurate" (Witten, Frank and Hall, 2011). The Naïve Bayes algorithm is quite
effective when applied in data sets and combined with selection procedures and
eliminating redundancies.
software is an example of software that implements
several algorithms related to data mining and extraction of knowledge.
Naive Bayes-based algorithms that calculate explicit probabilities for hypotheses are
among the most practical approaches to certain types of learning problems. Research has
shown that the Naive-Bayes classifier can overcome the performance of decision tree-
based algorithms and even neural networks (Mitchell, 1997).
1
ranks based on decision trees as "if is de sunny day then will not rain"
2
verify the probability of an event occurring
3
available at http://www.cs.waikato.ac.nz/ml/weka/
614
ISBN 978-88-96687-11-6
ISSN 2280787X

3 Methodology
The methodological classification of this research characterizes it as applied since it
produces immediate results, however, it is also basic to serve as the basis for other
research (Marconi and Lakatos, 2010). As for the objectives, it is descriptive, insofar as it
describes characteristics of a phenomenon and establishes relations between variables. In
seeking to establish limits, and approaches for new research, delimiting an unknown area,
it is also characterized as having an exploratory objective. It also presents an explanatory
objective, since it "deepens the knowledge of reality because it explains the reason, the
reason of things". (Gil, 2002). It has a qualitative approach, as researchers attribute
meanings to the data; on the other hand, it is quantitative because it follows the statistical
rigors, not only using samples but of the whole universe that involves Universities. We
used bibliographic, documentary and experimental procedures (Gil, 2008).
The research itself followed the following steps:
1. Get the data from the CAPES open data for the years 2014 and 2015, regarding
students, teachers, and courses. These data were processed and prepared, composing
a relational database. Next, the RUF data of 2015 and 2016 were obtained and were
treated and loaded into relational database tables.
2. We then need to mine the data, preparing a correlation and conversion table,
matching the University initials of both systems (CAPES and RUF). This was an
exhaustive task that even presented 2 incompatibilities that were not solved and that
are part of the general analysis of the data.
3. The CAPES data were then summarized, including Masters and Ph.D. courses of
each University, number of teachers, and final students. The predictors were each of
these summarized fields, and the decision obtained was whether or not it belonged to
the RUF, thus making compatible RUF and CAPES data.
4. Following the concepts of machine learning, we use the data from 2014 as "test",
training the machine. To do so, we use the WEKA4
5. Next, we submit the 2015 data, in order to establish the prediction, of which
Universities would be in RUF 2016 and compare it with the actual result.
Software, where we apply the
Naïve Bayes algorithm.
6. These data were then compared, and a confusion matrix was established, indicating
both false positives and negatives. The results are interesting because with only open
data a significant result was obtained, not requiring surveys, interviews, and other
data not open (such as quantity of publications, quotations, among others).
4
WEKA is an open source software, produced at the University of Waikatu (NZ) and is a collection of machine
learning algorithms for data mining tasks.
615
ISBN 978-88-96687-11-6
ISSN 2280787X

4 The experiment: Open data CAPES and RUF
The first step is to collect the raw data from both the RUF and the CAPEs.
From the RUF were collected the data of the ranking of 2015 and 2016 of the site and
generated a file in format CSV, that file contains all the data of RUF, adding the year of
reference.
Figure 1 - RUF as it is presented in Folha's website
Next, the data were standardized, ie they were prepared so that they could be handled
by the software tools.
The CAPE’s open data is provided in CSV format, which is a more suitable format for
processing the data relative to the RUF which is a web page. Because it is in CSV, the
data process is simpler than that applied to the RUF.
From the CAPE site of open data, the following files were downloaded for
postgraduate programs for the years 2014 and 2015: courses, teachers, and students
undergraduate.
With the raw data, we import them into a PostgreSQL database to facilitate the
process of data normalization and extraction in the format expected by WEKA software.
Although in CSV standard, this does not mean that your data is sanitized5
. For this
process of sanitization and import to the database, an Extract, Transform and Load (ETL)
tool used in the KDD process, more specifically in the preprocessing phase of the data.
The tool chosen is Pentaho's Data-integration6
(Figure 2).
5
process that standardizes the data, maintaining its validity
6
can be downloaded of the Pentaho Community in http://community.pentaho.com/projects/data-integration/
616
ISBN 978-88-96687-11-6
ISSN 2280787X

Source: Tools utilized
Figure 2 - Example ETL applied to RUF (left) and WEKA interface with CSV file import (dir)
With the Data-integration tool, all data of interest to the search was imported. Even
though both sources of data deal with the same domain - universities - there are
divergences between the initial of institutions between databases. Even with sanitized,
approximately 50 universities that are part of the RUF were not located in the CAPE data.
To minimize this difference, a manual analysis of the data was required.
O was provided for the learning of the machine in the case under study, it is precisely
the union between the CAPES data of 2014 and RUF 2015, it is called "training mass",
later to move to the algorithm already "trained", the new CAPE data and the same classify
and make the predictions based on the knowledge perceived in the training. The data used
were from CAPES 2014, obtaining UF and university initials compatible with RUF.
The training file was then imported into the WEKA, via a graphical interface, to apply
the Naïve Bayes algorithm and precision analysis (Figure 2).
Several analyses and tests were performed to identify a configuration with the best
possible result. It is important to emphasize that this is an extremely important activity for
the process of knowledge extraction and that it is linked to the required interdisciplinarity,
where an expert in the subject contributes to these adjustments.
Applying the algorithm to the training data, the result was a 78.95% success rate. The
confusion matrix generated is as follows:
a b <- Classified as
191 33 a = N
51 124 b = S
With this level of precision, training data were exported via WEKA to the ARFF
format.
617
ISBN 978-88-96687-11-6
ISSN 2280787X

After the training, we then have to predict the RUF for 2015. For this, the ARFF file
was generated with the CAPE data for classification by the algorithm learned by the
machine with the data of 2015. Again the ARFF will contain the same columns, however,
the decision will contain "?", indicating to the algorithm to predict:
Using a terminal7
As result, WEKA presents the classification performed for each instance of the file to
be classified. Figure 3 shows the partial result of WEKA processing.
in the OS X operating system, the following command was
executed to determine to WEKA to perform the classification based on the training data:
java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t training.arff -T sort.arff -p 3-8 –D.
=== Predictions on test data === inst# actual predicted error
prediction (cursos_mestrado, cursos_doutorado, formados_mestrado,
formados_doutorado, docentes_mestres, docentes_doutores)
In
st#
01 1
:?
1
:N
0.
996
(0,2,0,88,22,2)
02 1
:?
1
:N
0.
994
(0,2,0,169,28,0)
03 1
:?
1
:N
0.
988
(0,3,0,75,95,0)
… … … … …
… … … … …
23 1
:?
1
:N
0.
998
(1,0,0,0,40,0)
Source: WEKA output
Figure 3 - Data obtained from Weka
The last predicted column presents the prediction of each entry (data in parentheses),
thus informing that an institution with those characteristics of courses, teachers, and
students tends to be part of the RUF ranking. The result of the prediction was normalized
and imported into the PostgreSQL database. After that, RUF2016 was compared with the
result of the predictions, reaching the following result: of the 195 institutions that make
up the RUF 2016, 120 were predicted by the Naïve Bayes process with a 61.5% success.
7
also known as command line or shell, allows the user to ask the operating system to perform some actions such
as listing files, creating directories, running an application, among others
618
ISBN 978-88-96687-11-6
ISSN 2280787X

5 Conclusions
Emphasize the importance of objective criteria for institutional evaluation when we
assert that objective criteria and procedures that prioritize quantitative and comparable
aspects are required (Sobrinho and Ristoff, 2005).
The publication of linked open data expands this comparison process, allowing human
and non-human agents to process and analyze information. Berners-Lee (1989) suggests
that the data are open, especially those that can be classified with 5 stars in the future.
Classification Description
Available on the web (whatever format) but with an open licence, to be Open
Data
Available as machine-readable structured data (e.g. excel instead of image scan
of a table)
as (2) plus non-proprietary format (e.g. CSV instead of excel)
All the above plus, Use open standards from W3C (RDF and SPARQL) to
identify things, so that people can point at your stuff
All the above, plus: Link your data to other people’s data to provide context
References: Berners-Lee (1989, 2006)
The act of measuring although it is a part of the evaluation process of the society on
Universities, can not be considered in isolation (Vianna, 2014):
The evaluation will express the actions, attitudes, and values of both individuals and
communities or the science itself; if possible it should contemplate its multiple
dimensions and interrelationships. It will always produce effects over time, be they
political or pedagogical. An important part of the evaluation refers to the tests
applied, the questionnaires to be answered and the results obtained - this is what is
called the technical part of the evaluation; therefore, measurement is part of the
evaluation, but the evaluation is not exhausted in the measurement. This means that it
is not enough to assign notes, weights, and concepts.
A percentage beyond 60% of the RUF ranking shows that it is possible, with a more
detailed study and analysis of the techniques, to predict with a certain degree of
confidence. It should be noted that, according to the RUF, the Scientific Research (mostly
postgraduate) corresponds to a 42% weight in the ranking.
Another hypothesis is to make a cut, selecting the first 60 universities. Thus, an
algorithm to predict the 40, 50 or 60 best Brazilian universities, based strictly on open
CAPE data, may present a higher degree of confidence.
It is also observed that there are positive reflexes (above 60%) of the CAPES
processes on the quality management of the Postgraduate Programs of Universities,
intrinsically linked to the quality of higher education.
619
ISBN 978-88-96687-11-6
ISSN 2280787X

References
Berners-Lee, T. (1989) Information management: A proposal.
Witten, I. H. and Frank, E. (2011) Data Mining: Practical machine learning tools and techniques, ed
Morgan Kaufmann
Sobrinho, J. D. and Vessuro, H. (2008) Quality, evoluation: from sinaes to indexes, In Avaliação da
Educação Superior Magazine.
Fulmari, A. and Chandak, M. B. (2014) An approch for word sense disambiguation using modified
naïve bayes classifier, In International Journal of Innovative Research in Computer and
Communication Engineering.
Vianna, C. T. (2014) Avaliação institucional e o desafio da cultura da autoavaliação e cpa, In
conference's publications of regional seminar about institutional self-evaluation and
evaluations committees
Fayyad, U., Piatetsky-Shapiro, G. and Smyth P. (1996) From data mining to knowledge discovery
in databases. AI magazine, Vol. 17, No. 3, p. 37
Fulmari, A. and Chandak, M. B. (2014) An approach for word sense disambiguation using modified
naïve bayes classifier. International Journal of Innovative Research in Computer and
Communication Engineering, Vol. 2
Gil, A. C. (2002) Como elaborar projetos de pesquisa. São Paulo, Vol. 5
Gil, A. C. (2008) Métodos e técnicas de pesquisa social. In: Métodos e técnicas de pesquisa social.
Atlas
Gonçalves, A. L. (2006) Um modelo de descoberta de conhecimento baseado na correlação de
elementos textuais e expansão vetorial aplicado à engenharia e gestão do conhecimento. 196 f.
Tese (Doutorado) — Tese (Doutorado em Engenharia de Produção)-Programa de Pós-
Graduação em Engenharia de Produção, Universidade Federal de Santa Catarina, Florianópolis
Linoff, G. S. and Berry M. J. (2011) Data mining techniques: for marketing, sales, and customer
relationship management. John Wiley & Sons
Marconi, M. d. A. and Lakatos, E. M. (2010) Fundamentos de metodologia científica. In:
Fundamentos de metodologia científica. ed Atlas
Mitchell, T. M. (1997) Machine learning. New York
Rish I. (2001) An empirical study of the naive bayes classifier. In: IBM NEW YORK. IJCAI 2001
workshop on empirical methods in artificial intelligence. Vol. 3, No. 22, pp. 41–46.
Sobrinho, J. D. (2006) Acreditación de la educación superior en américa latina y el caribe. In:
TRES, J.; SANYK, B. C. (Ed.). La educación superior en el Mundo 2007. Acreditación para la
garantía de la calidad: ¿Qué está en juego? Global University Network for Innovation
Sobrinho, J. D. and Ristoff, D. I. (2005) Avaliação como instrumento da formação cidadã e do
desenvolvimento da sociedade democrática: por uma ético-epistemologia da avaliação. Ristoff,
Dilvo & Almeida JR, Vicente (organizadores). Avaliação Participativa, Perspectivas e
Debates, série Educação Superior em Debate, No. 1, pp. 15–38
Sobrinho, J. D. and Vessuri, H. (2006) Paradigmas e políticas de avaliação da educação superior.
autonomia e heteronomia. Universidad e investigación científica: convergências y tensiones.
Vessuri H, org. Buenos Aires: CLACSO, Consejo Latinoamericano de Ciencias Sociales, pp.
169–191
620
ISBN 978-88-96687-11-6
ISSN 2280787X

Brazil's university ranking a prediction study with machine learning 234 ifkad2018

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Brazil's university ranking a prediction study with machine learning 234 ifkad2018

Ähnlich wie Brazil's university ranking a prediction study with machine learning 234 ifkad2018 (20)

Mehr von IFSC

Mehr von IFSC (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Brazil's university ranking a prediction study with machine learning 234 ifkad2018