Subscription fraud analytics using classification

SUBSCRIPTION FRAUD ANALYTICS USING CLASSIFICATION

SOMDEEP KUMAR SEN
Trimax Analytics and Optimization Services
2/21/2014

Subscription Fraud Analytics Using Naïve Bayes Classifier

Contents
Introduction .................................................................................................................................... 3
Overview of the study ..................................................................................................................... 3
Objective of the Study .................................................................................................................... 4
Telecommunication fraud: an Overview ........................................................................................ 4
Definition .................................................................................................................................... 4
Types ........................................................................................................................................... 4
Subscription fraud ................................................................................................................... 4
Recharge Voucher Fraud......................................................................................................... 4
Pre-paid Balance Fraud ........................................................................................................... 4
Unauthorized Service Fraud.................................................................................................... 4
Models Used ................................................................................................................................... 5
Naïve Bayes Classification: an overview ..................................................................................... 5
Decision Tree (A Supervised Learning Method): ........................................................................ 5
Methodology................................................................................................................................... 6
Analysis & Findings ......................................................................................................................... 6
Using R ........................................................................................................................................ 6
Using RapidMiner...................................................................................................................... 13
Conclusion ..................................................................................................................................... 18

2

Introduction
The advancement of technological tools such as computers, the internet, and cellular phones
has made life easier and more convenient for most people in our society. However some
individuals and groups have subverted these telecommunication devices into tools to defraud
numerous unsuspecting victims. It is not uncommon for a scam to originate in a city, country,
state, or even a country different from that in which the victim resides. While, telecom fraud
may occur in different forms, the present study would focus upon the use of analytics to detect
subscription fraud. The study focuses on the application of Naïve Bayes Classification Algorithm
to detect & predict probable fraudsters.

Overview of the study
A fictitious telecom company called Bad Idea came up with a strange rate plan called Praxis
Plan where the callers are allowed to make only one call in the Morning (9AM-Noon),
Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight); i.e. four calls per day.
Despite the popularity of the plan, Bad Idea was a target of Subscription Fraud by a gang of
fraudsters consisting of three people: Sally, Virginia and Vince. They finally terminated their
services. Bad Idea has their call logs spanning over one and half months.
The analytics team of the company has been provided two data sets: Black-List Subscriber CallLogs & Audit Log. The Black-List Subscriber Call-Logs data set includes the calling patterns of the
three fraudsters i.e. Sally, Virginia and Vince. After every 5 days the company undertakes an
audit to see whether these Fraudsters have joined their network. The company reviews the list
of subscribers who have made calls to the same people as these three fraudsters and in the
same time frame. This has been provided in the Audit Log.
Test Data: http://bit.ly/1du9cRs
Training Data: http://bit.ly/1du9AQ1

3

Objective of the Study


To provide the Name of the probable callers and the confidence in terms of probability



To provide Name of the fraudster, if any



Code used to determine the subscriber

Telecommunication fraud: an Overview
Definition
Telecommunication fraud is the theft of telecommunication service (telephones, cell phones,
computers etc.) or the use of telecommunication service to commit other forms of fraud.
Victims include consumers, businesses and communication service providers.
Types
Subscription fraud
Subscriber fraud occurs when someone signs up for service with fraudulently-obtained
customer information or false identification. Lawbreakers obtain your personal information and
use it to set up a cell phone account in the name of the subscriber
Recharge Voucher Fraud
This mainly includes unusual top-up recharges and high number of recharges in a given timeperiod
Pre-paid Balance Fraud
Employees with high number of manual balance change as well as Subscribers with high
balances might be an indication of Pre-paid Balance Fraud
Unauthorized Service Fraud
HLR vs. Post-paid subscriber profile reconciliation, HLR services vs. Post-paid Subscriber services
Profile mis-match or sudden change in Subscriber usages could be possible indication of
Unauthorized Service Fraud

4

Models Used
Naïve Bayes Classification: an overview
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for
the underlying probability model would be "independent feature model".
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular
feature of a class is unrelated to the presence (or absence) of any other feature. For example, a
fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these
features depend on each other or upon the existence of the other features, a naive Bayes
classifier considers all of these properties to independently contribute to the probability that
this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained
very efficiently in a supervised learning setting. In many practical applications, parameter
estimation for naive Bayes models uses the method of maximum likelihood; in other words,
one can work with the naive Bayes model without believing in Bayesian probability or using any
Bayesian methods.
An advantage of the naive Bayes classifier is that it requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification.
Because independent variables are assumed, only the variances of the variables for each class
need to be determined and not the entire covariance matrix.
Decision Tree (A Supervised Learning Method):
A decision tree is a flowchart-like structure in which internal node represents test on an
attribute, each branch represents outcome of test and each leaf node represents class label
(decision taken after computing all attributes). A path from root to leaf represents classification
rules. In decision analysis a decision tree and the closely related influence diagram is used as a
visual and analytical decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated.
5

Methodology
In order to make the final prediction Naïve Bayes Classification has been conducted by using
two different packages in the form of R and Rapid Miner. This has been done in order to make
comparison between the results provided by the two packages.

Analysis & Findings
Using R
Our training data (BlackListSubscriberCallLogs), in the form of an excel sheet, has 138 instances
of the names of the people called by the fraudsters Sally, Vince and Virginia in each of the time
frames. We import this dataset into R as “blacklisted”.
We also have a file (Audit Log) of 15 instances where we predict the fraudster by the end of
this report. This is our unseen data. We import this dataset into R as “audit”
The Process:
 Import the datasets and understand them
 Install packages and load the libraries “caret” and “klaR” for Naïve Bayes and “party” for
Decision Tree
 Train our model(Naïve Bayes) using 10-fold cross validation
 Tweak the parameters of the model to obtain finer results
 Check for Accuracy and Kappa values
 Compare the result of Naïve Bayes model with 10-fold cross validated Decision Tree model

6


The above method is used for 10 fold cross validation, which divides the entire dataset in 9:1
parts (using 9 parts for training and 1 part for testing). It repeats this 10 times, reshuffling the
data each time. The outcome of the model is after it has trained itself from all the trials.
Now, we shuffle (for random sampling) our dataset (blacklisted) and take 15 observations
(about 10%) to apply our model and check for the accuracy against it. This set of observation
can be identified and called using the set.seed() function.

7


Upon analyzing the confusion matrix, we find that:
 The accuracy of our model is (6+3+2)/15 = 73.3%
 Precision of predicting Sally = 6/9 = 66.66%
 Precision of predicting Vince = 3/3 = 100%
 Precision of predicting Virginia = 2/3 = 66.66%
Now, we tweak our model, using Laplace (fL) and usekernel. Laplace is a smoothing technique
that assigns non-zero probability to events that do not occur in a sample. Usekernel is another
smoothing technique which is a non-parametric way to estimate the probability density
function of a random variable.

8


9


We observe that the outcome of both the models fit and fit1 (with Laplace and usekernel) are
identical in this case.
Thus, with 73% accuracy, we apply our model to the unseen data (audit).

To obtain the posterior probabilities for each set of observation in the unseen data, we type the
following command:

10


Now, we build the Decision Tree Model,

> plot (ctreeFit$finalModel)

11


Reading the graph: In the evening, if the call is made to Frank and at night, to Clark, and in the
morning the call reaches either, Kelly, Larry or Robert, the probability of the caller fraudster
being Vince is close to 80%.
The disadvantage of the Decision Tree is that the name of Sally shows below each of the nodes,
irrespective of the correct name.
We also find that the accuracy of Naïve Bayes (0.661) is better than the Decision Tree (0.578).
Thus, we now compare the Naïve Bayes output as obtained by Rapid Miner against the output
of R.

12

Using RapidMiner
Initially both the data sets are uploaded into Rapid Miner. The Black-List Subscriber Call-Logs is
named as telecom234 and the Audit log is named as telecom56. Many different classification
algorithms are available in Rapid Miner; out of all we choose the Naïve Bayes Classifier. Both
the telecom234 is dragged into the main window. In data mining we use the concept of data
splitting. In data splitting, we divide our data set into two parts, i.e., training set and validation
set. The purpose of training set is to create model, whereas validation set is used to estimate
the accuracy of the created model. To create model and to estimate its accuracy using the data
splitting technique, we use validation operators that can be found in Evaluation -> Validation
folder in operator window. Most commonly used operators are Split Validation and XValidation. We first make use of Split Validation. Drag and drop Split Validation operator into
process window. Split Validation operator is a group operator, i.e., it groups multiple operators
in it. Group operators have a special sign on them; they have two overlapping blue squares on
their icon as shown in figure below.

13


Validation operator has split ratio parameter (visible in parameter window on right), which
specifies how data set will be split. 0.7 in figure above will split data set into 70% of data for
training set and remaining 30% of data for testing set. Now double click Validation operator.
The Validation sub process window has two parts, i.e., Training and Testing. The split validation
operator is a nested one & we double click on it.
Now the Naïve Bayes Classification is entered into the training window. Validation allows us to
estimate the accuracy of our model. For this purpose, Rapid Miner provides many Performance
Operators in the Performance Measurement folder. Apply Model and Performance
(Classification) operators in testing window as shown in figure below

14


Now the averagable ports of validation operator with result port of Process window are
connected. From the result perspective one would get a performance vector with details about
our created model performance. For example, the created model has the accuracy of almost
71% as shown in figure below, which is quite good.

15

Once the model is created, it is time for using the model to perform classification/prediction.
The telecom56 data set is dragged into the main window. Telecom56 data set is the unlabeled
data set. Apply Model operator is also dragged into the main window. Apply Model operator
will get model from the Validation operator and will apply this model on input of un-labeled
data i.e. telecom56 data which is shown in the figure below.

16

Now, running the whole process would provide the prediction as shown the figure below in the
form of the name of the probable callers along with the confidence in terms of probability

17

Conclusion
Comparing the accuracy and precision from the confusion matrix of Rapid Miner and R results,
we see:
For R:


The accuracy of our model is (6+3+2)/15 = 73.3%



Precision of predicting Sally = 6/9 = 66.66%



Precision of predicting Vince = 3/3 = 100%



Precision of predicting Virginia = 2/3 = 66.66%

Rapid Miner:


The accuracy of our model is = 70.73%



Precision of predicting Sally = 81.82%



Precision of predicting Vince =58.33%



Precision of predicting Virginia = 72.22%

Since, both the statistical software gives accuracy above 70%, we can be confident about our
model and come to the conclusion that Naïve Bayes may be considered the best classifier in this
case, where the training data is considerably small and categorical.

18

Subscription fraud analytics using classification

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (6)

Similar to Subscription fraud analytics using classification

Similar to Subscription fraud analytics using classification (20)

More from Somdeep Sen

More from Somdeep Sen (10)

Recently uploaded

Recently uploaded (20)

Subscription fraud analytics using classification