Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Research Proposal
1. Gamer Behaviour Profiling in Video Games
by using Clustering Algorithms
MSc Research Project
Data Analytics
Yash Balaji Iyengar
Student ID: x18124739
School of Computing
National College of Ireland
Supervisor: Dr. Anu Sahni
2. National College of Ireland
Project Submission Sheet
School of Computing
Student Name: Yash Balaji Iyengar
Student ID: x18124739
Programme: Msc. Data Analytics
Year: 2019-20
Module: Research in Computing
Supervisor: Dr. Anu Sahni
Submission Due Date: 10th August 2019
Project Title: Gamer Behaviour Profiling in Video Games by using Cluster-
ing Algorithms
Word Count: 7340
Page Count: 20
I hereby certify that the information contained in this (my submission) is information
pertaining to research I conducted for this project. All information other than my own
contribution will be fully referenced and listed in the relevant bibliography section at the
rear of the project.
ALL internet material must be referenced in the bibliography section. Students are
required to use the Referencing Standard specified in the report template. To use other
author’s written or electronic work is illegal (plagiarism) and may result in disciplinary
action.
Signature:
Date: 10th August 2019
PLEASE READ THE FOLLOWING INSTRUCTIONS AND CHECKLIST:
Attach a completed copy of this sheet to each project (including multiple copies).
Attach a Moodle submission receipt of the online project submission, to
each project (including multiple copies).
You must ensure that you retain a HARD COPY of the project, both for
your own reference and in case a project is lost or mislaid. It is not sufficient to keep
a copy on computer.
Assignments that are submitted to the Programme Coordinator office must be placed
into the assignment box located outside the office.
Office Use Only
Signature:
Date:
Penalty Applied (if applicable):
3. Gamer Behaviour Profiling in Video Games by using
Clustering Algorithms
Yash Balaji Iyengar
x18124739
10th August 2019
Abstract
Since the past couple of decades, there has been a surge in the number of people
playing video games. Due to this, gaming has now become a promising industry
with lucrative opportunities for newer investments and career. Many entrepren-
eurs are eyeing this industry to conduct growing businesses. The demand for video
games has been growing and the video game developers have filled the market with
a variety of games of different genres available on multiple platforms. Out of all
these available games, very few games are liked by the majority of the gamers.
In order to understand the reasons behind the success of a video game, the game
developers study the data of its users. This data is called the game telemetry data
and it records its user’s gameplay activity. This data is analysed by applying ma-
chine learning models to understand the gamer’s behaviour patterns. This research
focuses on analysing the behaviour and understanding the thought process of a spe-
cial group of gamers called the Hard Core Gamers. By studying their behaviour
pattern game developers can devise strategies to market their game to a larger
audience and improve user experience by making changes in the game. This paper
proposes a methodology that will use multiple clustering algorithms like DB-Scan,
k - Medoids, Gaussian Mixture Model and K - means to segment Hard Core Gamer
telemetry data to uncover their playing pattern.
Keywords:Game Behaviour Profiling,Game Mining, Game Analytics, DB-Scan, K-
means
1
5. 1 Introduction
In recent times due to advancements in the field of Computer Science and the easy
availability of Personal Computers, the video gaming industry is persistently expanding.
Approximately 2.2 billion people around the world play video games Odierna and Silveira
(2019). According to Neill et al. (2016) in America around 60% of the citizens play com-
puter games. The video game market’s revenue has crossed 100 billion dollars according
to Rejer and Twardochleb (2018). Video games can be classified based on different plat-
forms like PC games, console games and mobile games. According to Choi et al. (2018),
more than half of the total revenue in the video game industry is generated due to PC
and Console games. On an average 1500 games of different genres are launched each year
Li et al. (2019).
1.1 Motivation
The odds of launching a successful game are very slim as the user has a variety of games
to choose from. There are many factors on which a video game’s success depends like
game design, game architecture, user interaction and advertising the game to the right
audience. According to Ahmad et al. (2017), there are different types of game developers
in the market. Triple-A (AAA) is a type of game developer with a large workforce and
financial budget. They are the ones that tend to make more successful games. Then
there are independent developers unlike AAA, they do not have high financial backing
and usually consist of a small workforce. It is observed that even they tend to make
successful games, for example, Minecraft. It is a very successful game developed by an
independent developer which was later purchased by Microsoft. As the industry is very
competitive, there are times when a game doesn’t do so well for example Assassin’s Creed
Unity. It is a game launched in the year 2014 by Ubisoft which is a AAA game developer,
the game was launched on the computer as well as console platform. The game had bad
reviews and the users did not like the game, due to this the company had to face losses.
Game designers have to answer a lot of questions in the process of designing a game
like ”How do our customers behave while playing a game?”, ”What do the users think
before buying a game?”, ”How should we design our game so that it attracts more users?”,
”Why do users get bored of playing a game after playing for some time?”. The answers to
these questions can be found out by analysing the game telemetry data Bauckhage et al.
(2012). A Video game is played by thousands of people around the world simultaneously,
this creates game telemetry data. These log files serve as a boon to the video game
developers as it becomes an unbiased source of user activity data. It is then evaluated
to understand gamer behaviour and tendencies. It helps the game designers understand
the user’s experience and motivation to play that game. This field of study is called
Game Analytics Bauckhage et al. (2012), Fernandes et al. (2019). Game mining is used
to understand a player’s in-game behaviour and the manner in which they engage with
the game. Various behavioural aspects of the user like the number of hours played, type
of games played, playing pattern and different genre of games played are analysed from
game mining. The analysis helps the game developers to understand their customers in
a better manner and make further development in their game by making design changes
or improving game mechanics by fixing bugs and releasing updates to enhance user ex-
3
6. perience or recommend games to the user based on their gaming behaviour. Due to this
user engagement in their game increases thereby increasing video game sales and profits.
1.2 Research Question
• Are different clustering techniques, such as DB-Scan, Fuzzy C - means and Gaussian
Mixture Models more effective than state of art techniques like K - means, EFA
and PCA in using factors like playing time, type of games played and amount of
hours played to classify gaming behaviour for increasing game popularity?
1.3 Research objectives
• To understand the behavioural tendencies of gamers that would help video game
developers to understand their customers.
• To help game designers identify bugs and faults in a game and fix them to enhance
user experience.
1.4 Paper Plan
The research proposal is divided into six sections, Section 2 discusses the studies which
have been previously carried out in this domain. Section 3 describes the approach chosen
and discusses the modeling techniques to be used in this research. Section 4 gives a brief
description of the various evaluation metrics to be used to validate the results. Section
5 gives the plan of execution for this thesis which will take place in semester three and
section 6 summarises the entire proposal.
2 Literature Review
The main focus of this section is to give a summary of previously conducted research
in the field. Since the research will be conducted in the field of Video Game Analytics,
various techniques and machine learning algorithms used to get different results will be
discussed in this section.
The related work section is divided into four subsections they are subsection 2.1,
subsection 2.2, subsection 2.3 and subsection 2.4
2.1 Behaviour Profiling Through Game Data Mining
Video games are played all over the world by a variety of age groups. Creation of an
entertaining and successful video game takes time and feedback. In order to enhance
the user experience, the game developing company tries to understand the needs and
habits of a player by trying leveraging the game telemetry data obtained from their
current version of their game. Using various data mining techniques different playing
patterns in the data are uncovered. This gives rise to gamer behaviour profiling Drachen
et al. (2017). My main motivation to conduct research in this field is due to the works
of Baumann et al. (2018). They have conducted their study on a special category of
players. These players as they have called them Hardcore Gamer Profile invest more
4
7. time than average video game players. Gameplay behaviour has been analysed by using
unsupervised learning. K-means clustering has been used to segment the player profiles
into six different clusters. The number of clusters has been decided by using the Silhouette
and the elbow method. The clustering quality has been determined by checking the
average silhouette score. The player profiles have been segmented into 6 different clusters
with each cluster giving a unique insight about the playing behaviour. Action games are
played more and show dominance in gaming behaviour. The players in the first cluster
tend to play more first-person shooter games. In cluster two, players tend to play free-to-
play games. Action games players are segmented under cluster three. Players in cluster
four play a Multiplayer Online Battle Arena game called Dota2, which is a free-to-play
game. Strategy games players are categorized under cluster five. These players play a
combination of both strategy and action games. Players from cluster six are considered
as Genre-switching players. This shows that they have broader thinking and are open to
the possibility of trying new games.
Another such study conducted by Odierna and Silveira (2019) uses similar methodo-
logy. They have used K-means Clustering for player classification. The study has been
conducted on an MMORPG game named World of WarCraft. The players are classified
based on Bartle Taxonomy. There are four types of players according to this taxonomy,
Killers, Achievers, Socializers and Explorers Bartle (1996). The authors have therefore
established the ground truth based on Bartle Taxonomy and set the K value to four while
performing the K- means algorithm. These four clusters classified players into Socialisers,
Achievers, Explorers and Killers. The largest group of players are identified and analysed.
Majority of players in a guild start with high levels. Players in the largest guild were
found to play around seven hours. Most players are categorized under Achievers and
Explorers. Game developers can leverage this information and develop the game in a
manner which caters to players needs. For further research, a larger dataset must be
considered.
Manero et al. (2016) classified gamers based on their gaming habits and preferences.
According to the author’s gamers can be classified based on the game genres played.
Their data collection is based on a questionnaire which includes game preference and
habitual questions. PCA has been performed on the collected data. The tool used is
IBM SPSS, their first two components combined explained approximately more than
50% of the variance in data. The first component comprises of shooter games, action
games and sports games. The second component consists of social games and musical
games. The authors then performed K-means clustering on these two components as input
similar to Zhang et al. (2016), the algorithm was run multiple times and four clusters
were obtained. These clusters classify gamers into four groups namely (casual gamers,
non-gamers, Hardcore gamers and Well-rounded gamers). One of the main limitations of
this research is its data collection method. Survey data can be heavily flawed as it doesnt
consider the biases due to the geographical locations, do people taking part in the survey
play video games or not. The sample size is 754 which is very less as compared to other
methods.
K-means clustering algorithm has wide applications and has been used with a different
approach by Zhang et al. (2016) to classify professional basketball data. NBA players
were classified based on team lineup arrangement and amount of minutes played into a
point guard, shooting guard, reserve guard and keyguard respectively by the coaches and
journalists. They analyzed the NBA season 2014-2015 players data and segmented the
players into 6 clusters. The data was pre-processed and prepared and consisted of player
5
8. name, team name, rebound, assists and score rate of each player. This data was for 120
players playing in the NBA league. The uniqueness lies in the selection of k value for
clustering and selection method for the first cluster centre. Unlike the other methods
mentioned above the authors used a mean squared error function to determine the total
number of clusters. This type of evaluation metric was used by Fu et al. (2017) in their
study for predicting customer retention. Out of 2 to 8 clusters results plotted it was
noticed that the mean square error value decreases and the clustering results become
stable after six clusters. The initial cluster centres are determined using the remote first
algorithm. The algorithm works in the following steps. It first randomly selects a cluster
centre out of all the points in the dataset. Then it calculates the distance of a point from
the selected centre known as the euclidean distance. Then again a new centre is chosen
and the same above steps are followed until k number of centres are determined.
Steam is a game distribution platform and is owned by Valve corporation, a game de-
veloping company B et al. (2018). Steam is a widely used platform millions of video game
players use it daily. An extensive amount of data is available on Steam. There are many
features in the data which are used to categorize the users behaviour. Neill et al. (2016)
have conducted a detailed study on gamer behaviour on the Steam platform. Different
aspects of behaviour are analysed which include time played, friend network, money spent
on gaming. Correlation is found between the number of achievements and the amount
of time a game is played. Heavy-tailed distribution is prevalent in a number of gaming
behaviours. It has been observed that the number of steam accounts purchased and the
friendships on the platform have increased since 2008. On studying the various groups on
Steam, it observed that game server groups have many members. The number of games
owned by a single player is more as compared to the number of games played. The study
has been concluded with an observation that majority users play games casually and the
number of players with high gameplay hours is few.
2.2 Game Analytics for Customer Retention
Players retention or customer retention is a very crucial aspect for a game developing
company as it is directly related to increases in revenue. Andrat and Ansari (2018) have
proposed a methodology where they identify how long players play a particular game
and detect if the player numbers are reducing. They use k-means and DBScan clustering
methods to conduct this study. The data used here is a survey with 12 questions based on
gaming habits. This survey data is then stored in a database. The tool used here is the
RapidMiner studio. The data is then cleaned and both clustering algorithms are applied
to the data. Clusters are formed for different genre of games played and the age-group of
the players playing the game. The DB-scan algorithm provided significantly better results
as compared to the k-means algorithm according to the authors. In another such study
conducted by Vallim et al. (2013), a different version of DB-scan algorithm was used to
find out the difference in playing behaviour while playing an online video game. A select
branch of Unsupervised learning has been used to detect changes in the behaviour called
Data Stream Mining. Multiple algorithms are applied to the data collected from a game
called Unreal Tournament. M-DBScan is a method which functions on incremental online
clustering. eM-DBScan is used by limiting novelty detection. eM-DBScaNB algorithms
showed better performance in change detection as compared to other algorithms. Such
types of modelling techniques are very helpful for game designers.
As it has been discussed above customer retention is a very important aspect in online
6
9. video games. For this purpose, the game telemetry data which is obtained from the players
that play the video game is studied and experimented with to create models that predict
the customer churn rate. Unlike other industries, CRM software cannot use a general
model to predict customer churn as different players stop playing for different reasons.
A method has been proposed by Fu et al. (2017) to bifurcate players into groups based
on in-game nature which helps in understanding why a player stops playing the game.
Here Fuzzy C- means clustering technique is applied on the game telemetry data which
clusters the data into 5 clusters. The data is collected on a multiplayer online role-playing
game called Dragon Nest. This data is then processed and cleared of all the null values
and the noise. The dataset consists of 19 features which are numerical and categorical in
nature. In order to get rid of the outliers normalization technique called zero-mean is used.
Unlike other clustering methods used by Rejer and Twardochleb (2018), a novel metric
is introduced called stickiness to find differences between the players. This stickiness is
obtained from the euclidean distance and they represent different player features related
to interaction, engagement and performance. Since there are 19 features in the data,
principal component analysis is performed on the data to reduce the dimensionality. The
first 7 components explain 86 % of the variance in the data. While applying the Fuzzy
C-means algorithm the smoothing parameter is tuned to 2 to acquire best results. Five
clusters are obtained which explain and segment the players into 5 categories, they are
leaders and aggressive gamers in cluster one, players who leave or churners in cluster
two, players who interact with other players in cluster three, players in cluster four try to
explore the game and try to learn new things about the game, players in cluster five-show
high level of gameplay and perform well in-game. This clustering-based model has proven
effective in understanding customer retention.
Kwon et al. (2018) have proposed a model that predicts customer churn based on
behavioural characteristics. The author has used a density-based clustering technique
and k - means clustering to extract features from an online mobile game. Seven player
behavioural features are extracted from the game daily for a months period. These
individual clusters are re-clustered using time series. This clustered data is used to predict
the churning of customers. The metrics used are Accuracy, recall, F-score and precision
to evaluate the model performance. Just like Vallim et al. (2013) and Bernardi M.L.
(2018) even in this study obtained better results with DB-scan clustering technique.
In a recent study by Li et al. (2019) they have proposed a method to analyze steam
user profiles to extract features so that it can be used to provide customized game re-
commendations. They have extracted steam user profile data by using Beautiful Soup
Python library. Apart from all the extracted features, they have created custom features
for profile customization which is a similar flow of methods like Manero et al. (2016).
Since there are multiple features, Exploratory Factor Analysis has been performed and 8
factors have been determined. Parallel Analysis has been used to decide the number of
factors into which the features should be grouped. Using parallel analysis thousands of
simulations were performed and it concluded with 8 factors being selected. Uncorrelated
variables have been simulated by using the Monte Carlo technique in Parallel Analysis.
A steam users behaviour can be determined with the help of these factors. This analysis
provides a precise idea of a users behaviour. Authors have suggested that an empirical
analysis should be performed for further validation of a gamer’s preference.
Archetypal Analysis is an unsupervised type of clustering algorithm which is used
when a dataset is by default distinctly categorised and consists of one distinct sample for
each category. This is known as the archetype of a dataset. This method has been used
7
10. in the past few years in many domains to segment data. This method was used by Sifa
et al. (2013) to extract behavioural patterns from the game Tomb Raider Underworld.
Simplex Volume Maximization algorithm and convex hull method were used on a dataset
that consisted of information on sixty-two thousand players of the game at that time
of release. The algorithm was applied on the seven levels of the game which were pre-
defined. Due to this, the authors selected Archetypal Analysis as a clustering method as it
was well suited for this application. One of the clusters discovered in level two, identified
users in-game deaths were higher due to environmental factors which may indicate players
might have difficulty in landing their jumps due to game mechanics. The only limitation
of this research is the authors have only conducted their research on an individual game.
Drachen et al. (2012) attempted to overcome the above shortcoming by using k -
means and Simplex volume maximization methods on multiple games data of different
genres to get a variety of results. The author first determines the number of features
to work on by using Principal Component Analysis. Since these features are of different
data types and ranges, it is normalized using the zero mean normalization method. The
author used k - means to detect the behaviour of the majority of players as the algorithm
works on classifying a data point based on its distance to the closest cluster Arbelaitz
et al. (2013). Due to the nature of the method, the majority of information is obtained
on the general players. The author uses the Simplex volume maximization algorithm
which is an Archetypal Analysis method to determine the behaviour pattern of extreme
players. This means it clusters information on players who are extra-ordinarily good at
the game or the players who are very bad at playing the game. This is because the
algorithm focuses on outside values or outer end extreme values. This information helps
game developers to understand the game from a better perspective and can make changes
or efforts to save churning of the players who are bad at the game.
In a further study, Drachen et al. (2017) used a combination of clustering methods to
understand player behaviour profiles. The authors have used a Gaussian mixture model,
Archetypal analytics, K- maxoid and k - means algorithms and applied it on a shooting
online game called Destiny. The data was converted to a comma-separated value format
by parsing the JSON file. The data is segregated into two parts one consists of data of
competitive play where two players play against each other. Another part is where the
player plays in a different atmosphere with or against the computer. This gameplay data
is analysed to obtain player in-game behaviour patterns. Features are selected manually
based on ground truth of the data and are bifurcated into three types how good the player
is, what level of mastery in-game the player has achieved and the playing pattern of the
player. After feature selection and classification exploratory data analysis is performed
on the data and correlation of all the numeric data types is checked. Skewed data is
then normalised using the zero mean technique which is also used by Fu et al. (2017)
in his method. The tool used for analysis is R programing and the package used is
Mclust package. K- means and Archetypal analysis have been applied in a similar way
as mentioned above like Drachen et al. (2012). K - means is used to understand the
general trends of players whereas Archetypal analysis and k - maxoid algorithms are used
to understand the outlier data behaviour that is the special case players. The Gaussian
mixture model is a method which provides flexibility to the user because unlike k - means
method this method forms ellipsoidal clusters and the user can decide the custom size of
the cluster where as k - means provides clusters with equal size and assumes the clusters
to be spherical in shape. Residual sum of squares and silhouette score are the metric
used to evaluate the number of clusters.
8
11. 2.3 Game Bot Detection with Supervised and Unsupervised
Learning methods
Till now we have only seen the use of Unsupervised algorithms for segmentation of data
and understanding behaviour patterns from it. But in a recent study, supervised as well as
unsupervised Machine learning techniques were used to determine the difference between
computer bots and human players based on certain game behavioural characteristics on a
role-playing popular video game dataset. Game bots detection is essential because game
bots are not restricted by their playing hours like humans as they rest and go to work
Choi et al. (2018). Rather game bots can play a video game nonstop and gain a higher
level and this spoils the user experience as when human players play against bots which
have a superior level in-game and lose against them, the players start losing interest in
the game. Thus customer retention becomes a big problem for that video game developer
company.
Bernardi et al. (2017) have attempted to distinguish between human players and game
bots by studying the game behaviour using time series. The experiment is carried out on
a dataset of a popular online free video game named Operation of Aion. Prepossessing is
first performed on the dataset. The dataset is divided into training and testing and class
imbalance has been taken care of with the help of different sampling methods. Feature
extraction is performed on the dataset to identify the important predictors similar to
Li et al. (2019) and Drachen et al. (2017). Correlation technique has been used for
extracting features. The extracted features are divided into five groups. Two groups
give more information about the player playing pattern within the game. Remaining
three groups give information about the player interaction and network connectivity.
Descriptive statistics were performed on individual features of different groups which
gave an idea of how different human players statistics and computer bots statistics look
as it was visible in the box plot plotted. Descriptive statistics were performed using the R
programming language. Classification is performed using time-series analysis. A neural
network model is used, it consists of multiple perceptron layers which extract information
and classify whether the game account is an actual human player or a bot. The concept
used here is backpropagation. the dataset is partitioned using the k - fold cross-validation
and the k value used is 10. The metrics considered for model performance evaluation are
precision, recall, F- measure and receiver operating characteristics Area. According to
the authors, the model developed was able to determine if the game account is a real
human or a computer bot within a second.
In the future research conducted by Bernardi M.L. (2018), six supervised algorithms
like HoeffdingTree, Random-Forest, Reduced Error Pruning Trees, a decision tree used
in java environment called J48, Random Tree and Decision Stump and multiple unsu-
pervised clustering algorithms like Density-Based Spatial clustering of applications with
noise, k- means, Farthest First algorithm, FilteredClusterer, Cobweb clustering and Make
Density clustering is used to improve the classification performance . The above al-
gorithms are used on five groups of data namely Player Information, Player Action,
Group Activities, Social Interaction Diversity and Network Measures. Each group con-
sists of multiple numbers of features. The two groups Player Information (PI) and Player
Action (PA) explain the features and shed light on the gamers playing habits which can
be a good indicator of how to distinguish between a human player and a game bot.
The author used recall, precision, F-measure and ROC and time required for running
the algorithm as metrics for determining the best result similar to Kwon et al. (2018).
9
12. J48 algorithm in the supervised methods and Make Density-Based Clusters in the un-
supervised algorithms provided the best result and classified the data very effectively.
The author further improved the result by using the BestFirst feature selection method.
After applying this method the precision and recall value improved for the RandomForest
algorithm from 0.954 and 0.953 to 0.954 and 0.960 respectively. Best precision and recall
value obtained from the clustering algorithms was 0.99 and 0.95 respectively.
2.4 Game Analytics in Free to Play Mobile Games
Mobile gaming has become a very vast industry in recent years. Most of the mobile
gaming industry works on a model called free to play games. Here the game developing
company develops and launches a video game in the market which is available and can
be downloaded by the users and played for free. The game developers main source of
income are the in-game purchases that the players make in terms of incentives provided.
So based on the purchasing power of players multiple advertising strategies are designed
which improves the players playing experience and increases revenue for the company. In
order to create effective marketing strategies, the in-game spending customers need to
be identified and segmented into groups. This is proposed by Yang et al. (2019), where
they use K- means clustering to differentiate between players based on in-game spending
capacity and deploy their model and confirm with the help of A/B testing, that their
model performs better than the existing benchmark model. The authors have categorized
the online mobile game called Thirty-six Stratagems players data into three types namely
low spenders or the players who spend less money to purchase game features, medium
spenders or moderate spenders and High spenders. This is used as the ground truth for
applying the clustering algorithm. Feature engineering is performed on the player data
collected like Manero et al. (2016). The feature selection algorithm is designed based on
an RFM model. This includes player characteristics like how many times a player makes
purchases, how recent is the player’s purchase and how much money the player spends
on making the in-game purchases. The metric used to decide the number of clusters is
Silhouette score which is similar to Baumann et al. (2018). A/B test is performed for a
duration of three days to determine the better algorithm. The evaluation metrics used
for determining the better algorithm are Key performance indicator, Paying user rate
and average revenue per paying user rate. According to the author, the clustering model
outperformed the base model.
Time series is a method which has been used to observe player behavioural trends in
mobile games and make marketing strategies to target a wider audience and retain the
customer base. Saas and Guitart (2017) has used a clustering algorithm on time series
data to get a deep understanding of user behaviour. The data is collected on a game
called Age of Ishtaria and common features that can be found in any game like time
spend on game, purchase history, game level and total log-ins are used for examination.
The author uses trend extraction for extracting features from the time series. Trend
extraction is used as it uncovers hidden player behaviour patterns. The author uses
a branch of Agglomerative clustering method called Ward method which measures low
variance. The optimal number of clusters evaluation technique used is Silhouette width
like Drachen et al. (2017) and different visualization methods. Eight behaviour clusters
are obtained as a result of this method. These clusters are then visualised using heatmaps
and the in-game purchase cluster is visualised using the box-plot as the data is less
because not all users spend money to buy in-game items. This data is used to observe
10
13. Author & Year Algorithm Used Evaluation Metric No. of
data
clusters
Odierna and Sil-
veira (2019)
K - means Bartle Taxonomy as
Ground Truth
4
Li et al. (2019) EFA PA & Monte Carlo 8
Yang et al. (2019) A/B Testing & K - means KPI, PUR & ARPPU 3
Baumann et al.
(2018)
K- means Silhoutte value & Elbow
method
6
Andrat and An-
sari (2018)
DB-scan & K - means Dataset Ground Truth &
DB Index
6
Kwon et al.
(2018)
DB-scan & K- means Accuracy, recall, precision 3
Bernardi M.L.
(2018)
PCA,REPT,FFA,J48 recall, precision, ROC 5
Saas and Guitart
(2017)
Trend Extration & Ag-
glomerative Clustering
Silhoutte width & visualiz-
ation
8
Fu et al. (2017) Fuzzy C- means & PCA cluster stickiness distance 5
Drachen et al.
(2017)
GMM, K - means, K -
maxoid
Silhoutte score & RSS 4
Bernardi et al.
(2017)
Feature Correlation &
Time Series
Precision, recall, ROC & F
- score
-
Manero et al.
(2016)
PCA & k - means no. clusters vs reliability
plot
4
Zhang et al.
(2016)
k - means mean squared error func-
tion
6
Vallim et al.
(2013)
DSM & M-DB-scan & eM-
DBscan
Elbow method 3
Sifa et al. (2013) SIVM Dataset Ground Truth 7
Drachen et al.
(2012)
SIVM, PCA & K - means Perecision, recall & ROC -
Table 1: Short Summary of Literature Review.For Detailed Explanation please refer 2
churn behaviour as well along with the in-game behaviour of the players.
After reviewing the above work by the knowledgeable authors it can be inferred that
a variety of machine learning methodologies have been used so far in the field of Game
Analytics to yield a wide variety of outcomes. Some authors like Odierna and Silveira
(2019), Fu et al. (2017) and Li et al. (2019) have used unsupervised learning methods
like k - means, Fuzzy C means and DB - Scan clustering whereas some authors like
Bernardi M.L. (2018), Bernardi et al. (2017) and Yang et al. (2019) have used an ensemble
of Supervised and unsupervised algorithm to improve the results. All the studies have
achieved very successful results and this makes conducting research in this field to get
exceptional results more difficult. However, this research is motivated by the works of
Baumann et al. (2018) who has used k - means algorithm and has suggested a future
work of using another clustering algorithm to obtain improved results.
11
14. 3 Methodology
The use of different Machine Learning algorithms to find insightful patterns and gain a
knowledgeable understanding of the data and the domain has been a common practice for
a long time. In order to generalize this process so that it can be followed and understood
all around the world it has been bifurcated into two methodologies Knowledge Discovery
in Database (KDD) Fayyad et al. (1996) and Cross-Industry Process for Data Mining
(CRISP-DM) Wirth and Hipp (1995). In section 2 we saw that Bernardi M.L. (2018)
advocated the use of KDD methodology to improve the efficiency of their output. So this
research will follow the KDD methodology.
Figure 1: KDD MethodologyFayyad et al. (1996)
3.1 Data Collection
The data will be collected from the Steam Web API 1
. Steam is a video game distribution
platform which has a userbase of millions and is an open-source of data. So this serves a
good source of data as we can extract a good unbiased sample from the population. The
data will be collected using the BeautifulSoup Python Package. It crawls through the
steam profiles and collects data on game usage. Another source of Data is collected from
this website 2
. It is also an open-source of data and is collected and maintained by Neill
et al. (2016). The data consists of at least more than 1 million rows. It’s preprocessing
will be explained below.
3.2 Ethics
The above-mentioned data collection method doesn’t compromise on the GDPR ethics.
As Steam is an open-source of data and it provides a 32 bit ID and a 64 bit ID to each
profile, no personal data is accessible and data columns like genre, playtime, game level
and other statistics will be available for analysis purpose whereas personal data will not
be compromised.
1
https://store.steampowered.com/stats/Steam-Game-and-Player-Statistics?l=english
2
https://steam.internet.byu.edu/
12
15. 3.3 Data Preprocessing
This is a very important part of the research and is the most time-consuming part due
to the given size of the dataset as mentioned in section 3.1 The data will be extracted
in XML and JSON format. It will then be parsed into CSV format using the XML and
jsonlite package in R. This data will then be checked for missing values. According to
Little and Rubin (2014) there are three types of missing data, Missing Completely at
Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR).
According to the author MAR and MCAR data can be ignored and omitted but the
MNAR data depends on other features in the data set and can be imputed using different
methods. This data will be treated using the R packages MICE and Amelia respectively.
3.3.1 Data Transformation
After cleaning the data, the dataset will be checked for possible noise or outlier values.
The correlation will be checked and Descriptive Statistics will be evaluated and noted.
Descriptive statistics involves checking the numerical and integer features by plotting
them in Density plots and Box plots. Distribution of data will be checked by plotting
Histogram plots. Since the datasets consist of multiple features dimensionality reduction
techniques will be applied like mixed data factor analysis since the dataset consists of
features of different data types.
3.4 Data Mining
After completion of data cleansing and data transformation, the data set will be parti-
tioned into training and testing data. Partition split will be 80 - 20 that is 80% for training
and 20% for testing. Class imbalance of various categorical features will be checked and
sampling methods will be applied if necessary. various modeling techniques to be used
will be discussed in this section. As observed in the section 2, multiple supervised as
well as unsupervised approaches have been used, clustering algorithms will be used in
this research for segmentation of gamer behaviour data.
3.4.1 K - Means Clustering Algorithm
K - means clustering is an unsupervised clustering algorithm that is used to segment
unlabelled data Shi et al. (2010) . Here the algorithm randomly assigns a point as the
centroid of the cluster and checks the distance of all the other data points from that
cluster. This distance is called the Euclidean distance. This algorithm works on two
criteria, that is the intracluster distance or distance of each data point within the cluster
should be as small as possible and the inter-cluster distance that is the distance between
two data points of different clusters should be as large as possible. The algorithm is
iterated and the cluster centroid is changed again and again until the above two criteria
are met. The number of clusters to be formed is based intuition due to domain knowledge
means the ground truth or there are various evaluation metrics which will be discussed
in the next section 4
3.4.2 DB-Scan Clustering Algorithm
Use of this algorithm in this domain has never been done before and is a novel approach
to game behaviour profiling. Density-based spatial clustering of applications with noise
13
16. abbreviated as (DB-Scan) is an unsupervised clustering algorithm which forms clusters of
data points based on the density of given points within a defined radius Ester et al. (1996).
There should be a set number of data points within a given radius of distance. This radius
is known as the Epsilon Distance and the density of points is set by minimum points
specified. According to the author, the algorithm needs only one input to determine the
cluster. The advantage of using this algorithm is it successfully clusters the data whether
it is in two dimensions, three dimensions or any high dimensional space unlike k - means
which can form only spherical clusters of data.
3.4.3 K - Medoids Clustering Algorithm
This algorithm tries to reduce the dis-similar nature of the data points with respect to the
cluster centroids Arora et al. (2016). It segments entire dataset into a group of clusters.
Unlike K - means it can be used on a dataset consisting of numerical as well as categorical
data. The algorithm is also known as Partitioning Around Medoids (PAN).
3.4.4 Gaussian Mixture Model
The Gaussian Mixture model is a probabilistic clustering model Drachen et al. (2017).
That means it calculates the probability for each data point being in a cluster and tries
to increase the probability of the data point. Unlike k - means which works on only
circular or spherical distribution of data, GMM can form clusters on different shapes in
higher-dimensional space. Unlike k - means which is a hard clustering method that means
it directly classifies a data point into a group, GMM gives a probability of the data point
being in each cluster and is thus more flexible.
3.4.5 Fuzzy C- Means Clustering Algorithm
It is an unsupervised clustering algorithm which forms overlapping clusters Wang (2010).
It means a data point can belong to two clusters at the same time due to its linkage or
membership with both cluster centres. This is also a soft clustering method and provides
flexibility. The algorithm calculates the distance between the data point and cluster
centre, the less the distance between them stronger is the membership of the data point
with that centre.
4 Interpretation and Evaluation
Results obtained after applying multiple data mining algorithms must be interpretable
to gain knowledge from the patterns of data. As mentioned by Fayyad et al. (1996)
Model Interpretation is very important from a business point of view as the results ob-
tained should be easy for the stakeholder or the decision-maker to understand. Based
on it important business decisions are made. As clustering is an unsupervised learning
algorithm it deals with unlabeled data and segments or groups it based on the variance
or the distance metric Maulik and Bandyopadhyay (2002). Unlike supervised algorithms,
there is no definite way of checking model accuracy. In order to determine the quality of
the clustering results, one needs to possess domain knowledge to get a slight intuition of
how many clusters can be formed and other parameters like cluster consistency, cluster
density are checked using different metrics. These evaluation metrics will be discussed in
this section.
14
17. 4.1 Davies-Bouldin Index
It is a clustering evaluation metric which determines the validity of the number of clusters.
It is given by the formula below. Andrat and Ansari (2018) have used this metric for
evaluation of the clusters in their study. It is the ratio of the addition of spread of data
points inside a cluster to the distance between two clusters Maulik and Bandyopadhyay
(2002). The index value should be less to get proper number of clusters. 3
DB ≡
1
N
N
i=1
Di (1)
where,
N ≡ Number of Clusters
Di ≡ Strength of clustering scheme
4.2 Silhoutte Score
Silhoutte score tries to check how similar are the data points within a given cluster. It
generally is calculated using a distance metric. It gives a score between -1 and +1. If the
score is closer to +1 means that the data points in a cluster are well matched and this
means that number of clusters formed are appropriate 4
. Authors like Baumann et al.
(2018), Drachen et al. (2017) and Saas and Guitart (2017) have advocated the use of
Silhouette score for evaluation of the number clusters.
s(i) ≡
b(i) − a(i)
max{a(i), b(i)}
, if |Ci| > 1 (2)
where,
a(i) ≡ mean similarity score for point i
b(i) ≡ mean dissimilarity score for point i
s(i) ≡ Silhoutte Score
4.3 Sum of Squared Errors
SSE ≡
n
i=1
(xi − x)2
(3)
where,
xi ≡ it
h observation in the group
x ≡ mean of the group
It is an evaluation metric used to determine the appropriate number of clusters on a
given data. It calculates the square of the difference between a data point and the mean
of its cluster 5
. This means that at the beginning the SSE is zero as each point in the
dataset represents its own cluster. The clustering algorithm is repeated for a different
number of cluster values and a graph of SSE for a different number of clusters is plotted.
Form the graph it can be seen that the SSE sharply drops down to a certain point and
slightly decreases. This point is used to determine the appropriate number of clusters.
3
https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index
4
https://en.wikipedia.org/wiki/Silhouette_(clustering)
5
https://hlab.stanford.edu/brian/error_sum_of_squares.html
15
18. Figure 2: No. of Clusters IdentificationThinsungnoen et al. (2015)
4.4 Dunn Index
This evaluation metric is similar to the Davis-Bouldin Index but it evaluates the clusters
based on how less the distance is between two clusters and checks which is the farthest
cluster Arbelaitz et al. (2013). A high value of Dunn index suggests a better quality of
clusters.6
.Below is the mathematical equation of Dunn Index.
D ≡
min1≤i<j≤n d(i, j)
max1≤k≤n d (k)
(4)
where,
i,j,k are the cluster indices
d ≡ inter cluster distance
d ≡ intra cluster distance
6
https://medium.com/@ODSC/assessment-metrics-for-clustering-algorithms-4a902e00d92d
16
19. 5 Plan of Execution
The below flow diagram gives an idea of how the technical part of the project will be
executed.
Figure 3: Process Flow Diagram
The Gantt chart below gives an overview of how the thesis project will be carried out
in the next semester.
Figure 4: Project Plan
6 Summary
After reviewing the various researches conducted in gamer behaviour profiling domain it
is clear that there are multiple studies that explain a gamer’s behaviour in general but
very less study is conducted on understanding the motive and behaviour of a Hardcore
gamer. A methodology has been proposed with a detailed explanation of the modeling
17
20. techniques and evaluation metrics. Since the game analytics domain is so vast, the area
of E-sports match prediction had got my attention, but due to very little research done
in the field and lack of time in this semester, that area could not be explored. For future
researchers, this is a bright area for applying machine learning to predict results. Apart
from that, a proper plan of action is devised for the timely execution of the thesis.
References
Ahmad, N. B., Barakji, S. A. R., Shahada, T. M. A. and Anabtawi, Z. A. (2017). How
to launch a successful video game: A framework, Entertainment Computing 23: 1–11.
URL: http://dx.doi.org/10.1016/j.entcom.2017.08.001
Andrat, H. and Ansari, N. (2018). Analyzing game stickiness using clustering techniques,
Advances in Intelligent Systems and Computing 554: 645–654.
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., P´erez, J. M. and Perona, I. (2013). An
extensive comparative study of cluster validity indices, Pattern Recognition 46(1): 243–
256.
Arora, P., Deepali and Varshney, S. (2016). Analysis of K-Means and K-Medoids Al-
gorithm for Big Data, Computer Science Procedia 78(December 2016): 507–512.
URL: http://dx.doi.org/10.1016/j.procs.2016.02.095
B, I. M., Savostyanov, D. and Litvyakov, B. (2018). Predicting Winning Team and
Probabilistic Ratings in Dota 2 and Counter-Strike : Global Offensive Video Games,
pp. 183–196.
Bartle, R. (1996). Hearts, Clubs, Diamonds, Spades: who suit MUDs., (August).
URL: https://mud.co.uk/richard/hcds.htm
Bauckhage, C., Kersting, K., Sifa, R., Thurau, C., Drachen, A. and Canossa, A. (2012).
How Players Lose Interest in Playing a Game : An Empirical Study Based on Distri-
butions of Total Playing Times, pp. 139–146.
Baumann, F., Emmert, D., Baumgartl, H. and Buettner, R. (2018). ScienceDirect Scien-
ceDirect Hardcore Gamer Profiling : Results from an unsupervised learning approach
to playing behavior on the Steam platform Hardcore Gamer Profiling : Results from an
unsupervised learning approach playing behavior on the, Procedia Computer Science
126: 1289–1297.
URL: https://doi.org/10.1016/j.procs.2018.08.078
Bernardi, M. L., Cimitile, M., Martinelli, F. and Mercaldo, F. (2017). A time series
classification approach to game bot detection, pp. 1–11.
Bernardi M.L., Cimitile M., M. F. M. F. (2018). A Machine Learning Approach for
Game Bot Detection Through Behavioural Features, Vol. 868, Springer International
Publishing.
URL: https://link.springer.com/chapter/10.1007/978-3-319-93641-36citeas
18
21. Choi, H. S., Ko, M. S., Medlin, D. and Chen, C. (2018). The effect of intrinsic and
extrinsic quality cues of digital video games on sales: An empirical investigation, De-
cision Support Systems 106: 86–96.
URL: https://doi.org/10.1016/j.dss.2017.12.005
Drachen, A., Green, J., Gray, C., Harik, E., Lu, P., Sifa, R. and Klabjan, D. (2017).
Guns and guardians: Comparative cluster analysis and behavioral profiling in destiny,
IEEE Conference on Computatonal Intelligence and Games, CIG .
Drachen, A., Sifa, R., Bauckhage, C. and Thurau, C. (2012). Guns, swords and data:
Clustering of player behavior in computer games in the wild, 2012 IEEE Conference
on Computational Intelligence and Games, CIG 2012 pp. 163–170.
Ester, M., Kriegel, H.-P., Jorg, S. and Xu, X. (1996). A Density-Based Clustering Al-
gorithms for Discovering Clusters, KDD-96 Proceedings 96(34): 226–231.
Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). From data mining to knowledge
discovery in databases, AI Magazine 17(3): 37–53.
Fernandes, L. V., Castanho, C. D. and Jacobi, R. P. (2019). A Survey on Game Analytics
in Massive Multiplayer Online Games, Brazilian Symposium on Games and Digital
Entertainment, SBGAMES 2018-November: 21–30.
Fu, X., Chen, X., Shi, Y. T., Bose, I. and Cai, S. (2017). User segmentation for retention
management in online social games, Decision Support Systems 101: 51–68.
URL: http://dx.doi.org/10.1016/j.dss.2017.05.015
Kwon, H., Jeong, W., Kim, D. W. and Yang, S. I. (2018). Clustering Player Behavioral
Data and Improving Performance of Churn Prediction from Mobile Game, 9th Interna-
tional Conference on Information and Communication Technology Convergence: ICT
Convergence Powered by Smart Intelligence, ICTC 2018 pp. 1252–1254.
Li, X., Lu, C., Peltonen, J. and Zhang, Z. (2019). A statistical analysis of Steam user
profiles towards personalized gamification, pp. 217–228.
Little, R. J. A. and Rubin, D. B. (2014). Statistical Analysis With Missing Data, (1): 312–
348.
Manero, B., Torrente, J., Freire, M. and Fern´andez-Manj´on, B. (2016). An instrument
to build a gamer clustering framework according to gaming preferences and habits,
Computers in Human Behavior 62: 353–363.
Maulik, U. and Bandyopadhyay, S. (2002). Performance evaluation of some clustering
algorithms and validity indices, IEEE Transactions on Pattern Analysis and Machine
Intelligence 24(12): 1650–1654.
Neill, M. O., Vaziripour, E., Wu, J. and Zappala, D. (2016). Condensing Steam : Distilling
the Diversity of Gamer Behavior, pp. 81–95.
Odierna, B. A. and Silveira, I. F. (2019). MMORPG Player Classi fi cation Using Game
Data Mining and K-means, Springer International Publishing.
URL: http://dx.doi.org/10.1007/978-3-030-12388-8 40
19
22. Rejer, I. and Twardochleb, M. (2018). Gamers’ involvement detection from fig EEG
data with cGAAM A method for feature selection for clustering, Expert Systems with
Applications 101: 196–204.
URL: https://doi.org/10.1016/j.eswa.2018.01.046
Saas, A. and Guitart, A. (2017). Discovering Playing Patterns : Time Series Clustering
of Free-To-Play Game Data.
Shi, N., Liu, X. and Guan, Y. (2010). Research on k-means clustering algorithm: An
improved k-means clustering algorithm, 3rd International Symposium on Intelligent
Information Technology and Security Informatics, IITSI 2010 pp. 63–67.
Sifa, R., Drachen, A., Bauckhage, C., Thurau, C. and Canossa, A. (2013). Behavior
evolution in Tomb Raider Underworld, IEEE Conference on Computatonal Intelligence
and Games, CIG pp. 1–8.
Thinsungnoen, T., Kaoungku, N., Durongdumronchai, P., Kerdprasop, K. and Kerdpra-
sop, N. (2015). The Clustering Validity with Silhouette and Sum of Squared Errors,
pp. 44–51.
Vallim, R. M. M., Filho, J. A. A., Mello, R. F. D. and Carvalho, A. C. P. L. F. D.
(2013). Expert Systems with Applications Online behavior change detection in com-
puter games, 40: 6258–6265.
Wang, Z. (2010). Comparison of four kinds of fuzzy C-means clustering methods, Proceed-
ings - 3rd International Symposium on Information Processing, ISIP 2010 pp. 563–566.
Wirth, R. and Hipp, J. (1995). CRISP-DM : Towards a Standard Process Model for Data
Mining, Proceedings of the Fourth International Conference on the Practical Applica-
tion of Knowledge Discovery and Data Mining (24959): 29–39.
Yang, W., Yang, G., Huang, T., Chen, L. and Liu, Y. E. (2019). Whales, Dolphins,
or Minnows? Towards the Player Clustering in Free Online Games Based on Pur-
chasing Behavior via Data Mining Technique, Proceedings - 2018 IEEE International
Conference on Big Data, Big Data 2018 pp. 4101–4108.
Zhang, L., LU, F. L., LIU, A., GUO, P. and LIU, C. (2016). Application of K-Means Clus-
tering Algorithm for Classification of NBA Guards, International Journal of Science
and Engineering Applications 5(1): 1–6.
20