1. Experience of big data analytics:
1. Data analysis for students’ performance and behaviors
To collect students’ performance and behavior data, I have designed and developed a service
oriented system in PHP and Flex. Teaching staff can manage question storage and start a test for
the students. The joined students access the system via Web browsers and mobile devices (such as,
iOS devices and Android devices). The system can be utilised during classes and after classes in
order to test students’ performance. I have been involved in the design and development of the
whole service oriented system.
Currently, the system is using MySQL as the data storage. After a test or some period of time,
the new students’ scores and behavior data (such as, each question answering time and
modification time) are extracted from MySQL and saved to csv files. I use MATLAB and R
programming languages to perform data visualization (such as, create histograms, density plots,
etc.) to analyze the relationship between these features and their scores. Additionally, I utilise
Python programming language to implement similarity computation algorithms (such as,
Euclidean distance and Pearson correlation coefficient) to compute similarity between students. At
the same time, K-Means clustering approach is employed to analyze clusters of the students. From
the analysed results, students can be divided into several clusters with similar scores and
behaviors. Since the questions in the system are designed for different knowledge categories and
each category has more than one question, teachers can create teaching strategies to guide the
students with similar performance. To test the students and realize whether they have learnt the
knowledge that they have failed to answer, similar questions answered by similar students in the
same cluster can be automatically recommended by the system. I have finished the service in
Python.
2. Data analysis for home automation
A prototype of home automation is implemented with 51 single chip microcomputers.
Temperature, humidity, ultrasonic wave and light sensors connected to the single chip
microcomputers automatically collect environment data sent to a service implemented in PHP in
real-time via HTTP protocol. The service can send the data to HBase via Thrift deployed on
OpenStack. I am mainly responsible for the implementation of the service, data storage and setup
of Hadoop, HBase and OpenStack.
I have implemented a naïve Bayes classifier in Python. I divide the data in HBase into two
categories: one is for training the classifier and the other is for testing. I have set up some
categories and they can separately invoke different services for the response to humans. After
training of the classifier, it can predict possible services that individuals may need.
3. Data analysis for train derailment
Data sets about train derailment can be collected from some companies. However, sometimes
they are HTML and XML formats. Therefore, I use universal feed parser and beautiful soap,
packages of Python, to extract valuable data and save them into a csv or txt file. Additionally, train
derailment may be related with weather information. I have downloaded open data sets from some
Websites (such as, Met Office, etc.) describing past UK climates.