SlideShare a Scribd company logo
1 of 18
Download to read offline
SUBSCRIPTION FRAUD ANALYTICS USING CLASSIFICATION

SOMDEEP KUMAR SEN
Trimax Analytics and Optimization Services
2/21/2014
Subscription Fraud Analytics Using Naïve Bayes Classifier

Contents
Introduction .................................................................................................................................... 3
Overview of the study ..................................................................................................................... 3
Objective of the Study .................................................................................................................... 4
Telecommunication fraud: an Overview ........................................................................................ 4
Definition .................................................................................................................................... 4
Types ........................................................................................................................................... 4
Subscription fraud ................................................................................................................... 4
Recharge Voucher Fraud......................................................................................................... 4
Pre-paid Balance Fraud ........................................................................................................... 4
Unauthorized Service Fraud.................................................................................................... 4
Models Used ................................................................................................................................... 5
Naïve Bayes Classification: an overview ..................................................................................... 5
Decision Tree (A Supervised Learning Method): ........................................................................ 5
Methodology................................................................................................................................... 6
Analysis & Findings ......................................................................................................................... 6
Using R ........................................................................................................................................ 6
Using RapidMiner...................................................................................................................... 13
Conclusion ..................................................................................................................................... 18

2
Subscription Fraud Analytics Using Naïve Bayes Classifier
Introduction
The advancement of technological tools such as computers, the internet, and cellular phones
has made life easier and more convenient for most people in our society. However some
individuals and groups have subverted these telecommunication devices into tools to defraud
numerous unsuspecting victims. It is not uncommon for a scam to originate in a city, country,
state, or even a country different from that in which the victim resides. While, telecom fraud
may occur in different forms, the present study would focus upon the use of analytics to detect
subscription fraud. The study focuses on the application of Naïve Bayes Classification Algorithm
to detect & predict probable fraudsters.

Overview of the study
A fictitious telecom company called Bad Idea came up with a strange rate plan called Praxis
Plan where the callers are allowed to make only one call in the Morning (9AM-Noon),
Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight); i.e. four calls per day.
Despite the popularity of the plan, Bad Idea was a target of Subscription Fraud by a gang of
fraudsters consisting of three people: Sally, Virginia and Vince. They finally terminated their
services. Bad Idea has their call logs spanning over one and half months.
The analytics team of the company has been provided two data sets: Black-List Subscriber CallLogs & Audit Log. The Black-List Subscriber Call-Logs data set includes the calling patterns of the
three fraudsters i.e. Sally, Virginia and Vince. After every 5 days the company undertakes an
audit to see whether these Fraudsters have joined their network. The company reviews the list
of subscribers who have made calls to the same people as these three fraudsters and in the
same time frame. This has been provided in the Audit Log.
Test Data: http://bit.ly/1du9cRs
Training Data: http://bit.ly/1du9AQ1

3
Subscription Fraud Analytics Using Naïve Bayes Classifier
Objective of the Study


To provide the Name of the probable callers and the confidence in terms of probability



To provide Name of the fraudster, if any



Code used to determine the subscriber

Telecommunication fraud: an Overview
Definition
Telecommunication fraud is the theft of telecommunication service (telephones, cell phones,
computers etc.) or the use of telecommunication service to commit other forms of fraud.
Victims include consumers, businesses and communication service providers.
Types
Subscription fraud
Subscriber fraud occurs when someone signs up for service with fraudulently-obtained
customer information or false identification. Lawbreakers obtain your personal information and
use it to set up a cell phone account in the name of the subscriber
Recharge Voucher Fraud
This mainly includes unusual top-up recharges and high number of recharges in a given timeperiod
Pre-paid Balance Fraud
Employees with high number of manual balance change as well as Subscribers with high
balances might be an indication of Pre-paid Balance Fraud
Unauthorized Service Fraud
HLR vs. Post-paid subscriber profile reconciliation, HLR services vs. Post-paid Subscriber services
Profile mis-match or sudden change in Subscriber usages could be possible indication of
Unauthorized Service Fraud

4
Subscription Fraud Analytics Using Naïve Bayes Classifier
Models Used
Naïve Bayes Classification: an overview
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for
the underlying probability model would be "independent feature model".
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular
feature of a class is unrelated to the presence (or absence) of any other feature. For example, a
fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these
features depend on each other or upon the existence of the other features, a naive Bayes
classifier considers all of these properties to independently contribute to the probability that
this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained
very efficiently in a supervised learning setting. In many practical applications, parameter
estimation for naive Bayes models uses the method of maximum likelihood; in other words,
one can work with the naive Bayes model without believing in Bayesian probability or using any
Bayesian methods.
An advantage of the naive Bayes classifier is that it requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification.
Because independent variables are assumed, only the variances of the variables for each class
need to be determined and not the entire covariance matrix.
Decision Tree (A Supervised Learning Method):
A decision tree is a flowchart-like structure in which internal node represents test on an
attribute, each branch represents outcome of test and each leaf node represents class label
(decision taken after computing all attributes). A path from root to leaf represents classification
rules. In decision analysis a decision tree and the closely related influence diagram is used as a
visual and analytical decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated.
5
Subscription Fraud Analytics Using Naïve Bayes Classifier
Methodology
In order to make the final prediction Naïve Bayes Classification has been conducted by using
two different packages in the form of R and Rapid Miner. This has been done in order to make
comparison between the results provided by the two packages.

Analysis & Findings
Using R
Our training data (BlackListSubscriberCallLogs), in the form of an excel sheet, has 138 instances
of the names of the people called by the fraudsters Sally, Vince and Virginia in each of the time
frames. We import this dataset into R as “blacklisted”.
We also have a file (Audit Log) of 15 instances where we predict the fraudster by the end of
this report. This is our unseen data. We import this dataset into R as “audit”
The Process:
 Import the datasets and understand them
 Install packages and load the libraries “caret” and “klaR” for Naïve Bayes and “party” for
Decision Tree
 Train our model(Naïve Bayes) using 10-fold cross validation
 Tweak the parameters of the model to obtain finer results
 Check for Accuracy and Kappa values
 Compare the result of Naïve Bayes model with 10-fold cross validated Decision Tree model

6
Subscription Fraud Analytics Using Naïve Bayes Classifier

The above method is used for 10 fold cross validation, which divides the entire dataset in 9:1
parts (using 9 parts for training and 1 part for testing). It repeats this 10 times, reshuffling the
data each time. The outcome of the model is after it has trained itself from all the trials.
Now, we shuffle (for random sampling) our dataset (blacklisted) and take 15 observations
(about 10%) to apply our model and check for the accuracy against it. This set of observation
can be identified and called using the set.seed() function.

7
Subscription Fraud Analytics Using Naïve Bayes Classifier

Upon analyzing the confusion matrix, we find that:
 The accuracy of our model is (6+3+2)/15 = 73.3%
 Precision of predicting Sally = 6/9 = 66.66%
 Precision of predicting Vince = 3/3 = 100%
 Precision of predicting Virginia = 2/3 = 66.66%
Now, we tweak our model, using Laplace (fL) and usekernel. Laplace is a smoothing technique
that assigns non-zero probability to events that do not occur in a sample. Usekernel is another
smoothing technique which is a non-parametric way to estimate the probability density
function of a random variable.

8
Subscription Fraud Analytics Using Naïve Bayes Classifier

9
Subscription Fraud Analytics Using Naïve Bayes Classifier

We observe that the outcome of both the models fit and fit1 (with Laplace and usekernel) are
identical in this case.
Thus, with 73% accuracy, we apply our model to the unseen data (audit).

To obtain the posterior probabilities for each set of observation in the unseen data, we type the
following command:

10
Subscription Fraud Analytics Using Naïve Bayes Classifier

Now, we build the Decision Tree Model,

> plot (ctreeFit$finalModel)

11
Subscription Fraud Analytics Using Naïve Bayes Classifier

Reading the graph: In the evening, if the call is made to Frank and at night, to Clark, and in the
morning the call reaches either, Kelly, Larry or Robert, the probability of the caller fraudster
being Vince is close to 80%.
The disadvantage of the Decision Tree is that the name of Sally shows below each of the nodes,
irrespective of the correct name.
We also find that the accuracy of Naïve Bayes (0.661) is better than the Decision Tree (0.578).
Thus, we now compare the Naïve Bayes output as obtained by Rapid Miner against the output
of R.

12
Subscription Fraud Analytics Using Naïve Bayes Classifier
Using RapidMiner
Initially both the data sets are uploaded into Rapid Miner. The Black-List Subscriber Call-Logs is
named as telecom234 and the Audit log is named as telecom56. Many different classification
algorithms are available in Rapid Miner; out of all we choose the Naïve Bayes Classifier. Both
the telecom234 is dragged into the main window. In data mining we use the concept of data
splitting. In data splitting, we divide our data set into two parts, i.e., training set and validation
set. The purpose of training set is to create model, whereas validation set is used to estimate
the accuracy of the created model. To create model and to estimate its accuracy using the data
splitting technique, we use validation operators that can be found in Evaluation -> Validation
folder in operator window. Most commonly used operators are Split Validation and XValidation. We first make use of Split Validation. Drag and drop Split Validation operator into
process window. Split Validation operator is a group operator, i.e., it groups multiple operators
in it. Group operators have a special sign on them; they have two overlapping blue squares on
their icon as shown in figure below.

13
Subscription Fraud Analytics Using Naïve Bayes Classifier

Validation operator has split ratio parameter (visible in parameter window on right), which
specifies how data set will be split. 0.7 in figure above will split data set into 70% of data for
training set and remaining 30% of data for testing set. Now double click Validation operator.
The Validation sub process window has two parts, i.e., Training and Testing. The split validation
operator is a nested one & we double click on it.
Now the Naïve Bayes Classification is entered into the training window. Validation allows us to
estimate the accuracy of our model. For this purpose, Rapid Miner provides many Performance
Operators in the Performance Measurement folder. Apply Model and Performance
(Classification) operators in testing window as shown in figure below

14
Subscription Fraud Analytics Using Naïve Bayes Classifier

Now the averagable ports of validation operator with result port of Process window are
connected. From the result perspective one would get a performance vector with details about
our created model performance. For example, the created model has the accuracy of almost
71% as shown in figure below, which is quite good.

15
Subscription Fraud Analytics Using Naïve Bayes Classifier
Once the model is created, it is time for using the model to perform classification/prediction.
The telecom56 data set is dragged into the main window. Telecom56 data set is the unlabeled
data set. Apply Model operator is also dragged into the main window. Apply Model operator
will get model from the Validation operator and will apply this model on input of un-labeled
data i.e. telecom56 data which is shown in the figure below.

16
Subscription Fraud Analytics Using Naïve Bayes Classifier
Now, running the whole process would provide the prediction as shown the figure below in the
form of the name of the probable callers along with the confidence in terms of probability

17
Subscription Fraud Analytics Using Naïve Bayes Classifier
Conclusion
Comparing the accuracy and precision from the confusion matrix of Rapid Miner and R results,
we see:
For R:


The accuracy of our model is (6+3+2)/15 = 73.3%



Precision of predicting Sally = 6/9 = 66.66%



Precision of predicting Vince = 3/3 = 100%



Precision of predicting Virginia = 2/3 = 66.66%

Rapid Miner:


The accuracy of our model is = 70.73%



Precision of predicting Sally = 81.82%



Precision of predicting Vince =58.33%



Precision of predicting Virginia = 72.22%

Since, both the statistical software gives accuracy above 70%, we can be confident about our
model and come to the conclusion that Naïve Bayes may be considered the best classifier in this
case, where the training data is considerably small and categorical.

18

More Related Content

What's hot

A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageeSAT Publishing House
 
A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageeSAT Journals
 
A Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine LearningA Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine Learningijtsrd
 
Pollyanna Document Classifier
Pollyanna Document ClassifierPollyanna Document Classifier
Pollyanna Document ClassifierVijay PG
 
credit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstractcredit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstractVenkat Projects
 
Ijigsp v6-n2-6
Ijigsp v6-n2-6Ijigsp v6-n2-6
Ijigsp v6-n2-6Anita Pal
 
Dn31538540
Dn31538540Dn31538540
Dn31538540IJMER
 
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsCredit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsHariteja Bodepudi
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
 
Leveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk AssessmentLeveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk AssessmentErik De Monte
 
Discovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile appsDiscovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile appsjpstudcorner
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random ForestIRJET Journal
 
SAS Data Mining - Crime Modeling
SAS Data Mining - Crime ModelingSAS Data Mining - Crime Modeling
SAS Data Mining - Crime ModelingJohn Michael Croft
 
Crime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenCrime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenSmarten Augmented Analytics
 
How Kyriba Helps Protect You From Payments Fraud
How Kyriba Helps Protect You From Payments FraudHow Kyriba Helps Protect You From Payments Fraud
How Kyriba Helps Protect You From Payments FraudKyriba Corporation
 

What's hot (16)

A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakage
 
A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakage
 
A Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine LearningA Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine Learning
 
Pollyanna Document Classifier
Pollyanna Document ClassifierPollyanna Document Classifier
Pollyanna Document Classifier
 
credit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstractcredit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstract
 
Ijigsp v6-n2-6
Ijigsp v6-n2-6Ijigsp v6-n2-6
Ijigsp v6-n2-6
 
Dn31538540
Dn31538540Dn31538540
Dn31538540
 
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsCredit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
 
Leveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk AssessmentLeveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk Assessment
 
Discovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile appsDiscovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile apps
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random Forest
 
SAS Data Mining - Crime Modeling
SAS Data Mining - Crime ModelingSAS Data Mining - Crime Modeling
SAS Data Mining - Crime Modeling
 
Crime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenCrime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – Smarten
 
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
 
How Kyriba Helps Protect You From Payments Fraud
How Kyriba Helps Protect You From Payments FraudHow Kyriba Helps Protect You From Payments Fraud
How Kyriba Helps Protect You From Payments Fraud
 

Viewers also liked

Detecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networksDetecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networksJamal Meselmani
 
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...cVidya Networks
 
Telecom Fraud Detection
Telecom Fraud DetectionTelecom Fraud Detection
Telecom Fraud DetectionPunit Kishore
 
Fraud Management Industry Update Webinar by cVidya
Fraud Management Industry Update Webinar by cVidyaFraud Management Industry Update Webinar by cVidya
Fraud Management Industry Update Webinar by cVidyacVidya Networks
 
Frauds in telecom sector
Frauds in telecom sectorFrauds in telecom sector
Frauds in telecom sectorsksahu099
 

Viewers also liked (6)

Detecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networksDetecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networks
 
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...
 
Telecom Fraud Detection
Telecom Fraud DetectionTelecom Fraud Detection
Telecom Fraud Detection
 
Fraud in Telecoms
Fraud in TelecomsFraud in Telecoms
Fraud in Telecoms
 
Fraud Management Industry Update Webinar by cVidya
Fraud Management Industry Update Webinar by cVidyaFraud Management Industry Update Webinar by cVidya
Fraud Management Industry Update Webinar by cVidya
 
Frauds in telecom sector
Frauds in telecom sectorFrauds in telecom sector
Frauds in telecom sector
 

Similar to Subscription fraud analytics using classification

A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...IJTET Journal
 
Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Ashish Ranjan
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksBayesia USA
 
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...55296
 
Statistics For Bi
Statistics For BiStatistics For Bi
Statistics For BiAngela Hays
 
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdf
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdfTanvi_Sharma_Shruti_Garg_pre.pdf.pdf
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdfShrutiGarg649495
 
Lecture 22
Lecture 22Lecture 22
Lecture 22Shani729
 
San Francisco Crime Prediction Report
San Francisco Crime Prediction ReportSan Francisco Crime Prediction Report
San Francisco Crime Prediction ReportRohit Dandona
 
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime DatasetsData Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime DatasetsAnkit Ghosalkar
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsHarsh Parekh
 
San Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSan Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSameer Darekar
 
Introduction to Digital Biomarkers V1.0
Introduction to Digital Biomarkers V1.0Introduction to Digital Biomarkers V1.0
Introduction to Digital Biomarkers V1.0Barry Vant-Hull
 
Cost optimized reliability test planning rev 7
Cost optimized reliability test planning rev 7Cost optimized reliability test planning rev 7
Cost optimized reliability test planning rev 7ASQ Reliability Division
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market ResearchTed Clark
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET Journal
 
Implementing Clinical Decision Support System Using Naïve Bayesian Classifier
Implementing Clinical Decision Support System Using Naïve Bayesian ClassifierImplementing Clinical Decision Support System Using Naïve Bayesian Classifier
Implementing Clinical Decision Support System Using Naïve Bayesian Classifierrahulmonikasharma
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
 

Similar to Subscription fraud analytics using classification (20)

A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
 
Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Telecom Fraudsters Prediction
Telecom Fraudsters Prediction
 
Ba group3
Ba group3Ba group3
Ba group3
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian Networks
 
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...
 
Statistics For Bi
Statistics For BiStatistics For Bi
Statistics For Bi
 
Data Mining Lec1.pptx
Data Mining Lec1.pptxData Mining Lec1.pptx
Data Mining Lec1.pptx
 
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdf
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdfTanvi_Sharma_Shruti_Garg_pre.pdf.pdf
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdf
 
Lecture 22
Lecture 22Lecture 22
Lecture 22
 
San Francisco Crime Prediction Report
San Francisco Crime Prediction ReportSan Francisco Crime Prediction Report
San Francisco Crime Prediction Report
 
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime DatasetsData Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data Analytics
 
San Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSan Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contest
 
Introduction to Digital Biomarkers V1.0
Introduction to Digital Biomarkers V1.0Introduction to Digital Biomarkers V1.0
Introduction to Digital Biomarkers V1.0
 
Cost optimized reliability test planning rev 7
Cost optimized reliability test planning rev 7Cost optimized reliability test planning rev 7
Cost optimized reliability test planning rev 7
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market Research
 
Final Report
Final ReportFinal Report
Final Report
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction System
 
Implementing Clinical Decision Support System Using Naïve Bayesian Classifier
Implementing Clinical Decision Support System Using Naïve Bayesian ClassifierImplementing Clinical Decision Support System Using Naïve Bayesian Classifier
Implementing Clinical Decision Support System Using Naïve Bayesian Classifier
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 

More from Somdeep Sen

Introduction to Product
Introduction to ProductIntroduction to Product
Introduction to ProductSomdeep Sen
 
Comparison between drugs in prevention of post anesthetic shivering
Comparison between drugs in prevention of post anesthetic shiveringComparison between drugs in prevention of post anesthetic shivering
Comparison between drugs in prevention of post anesthetic shiveringSomdeep Sen
 
Sample phone bill analysis
Sample phone bill analysisSample phone bill analysis
Sample phone bill analysisSomdeep Sen
 
Multiple regression to findout drivers of online satisfaction
Multiple regression to findout drivers of  online satisfactionMultiple regression to findout drivers of  online satisfaction
Multiple regression to findout drivers of online satisfactionSomdeep Sen
 
Multiple regression
Multiple regressionMultiple regression
Multiple regressionSomdeep Sen
 
Market Potential of HCL
Market Potential of HCL Market Potential of HCL
Market Potential of HCL Somdeep Sen
 
Consumer Behavior Analysis: A study of Cafe Coffee Day
Consumer Behavior Analysis: A study of Cafe Coffee DayConsumer Behavior Analysis: A study of Cafe Coffee Day
Consumer Behavior Analysis: A study of Cafe Coffee DaySomdeep Sen
 
Introduction to Pinterest
Introduction to PinterestIntroduction to Pinterest
Introduction to PinterestSomdeep Sen
 

More from Somdeep Sen (10)

Introduction to Product
Introduction to ProductIntroduction to Product
Introduction to Product
 
Comparison between drugs in prevention of post anesthetic shivering
Comparison between drugs in prevention of post anesthetic shiveringComparison between drugs in prevention of post anesthetic shivering
Comparison between drugs in prevention of post anesthetic shivering
 
Sample phone bill analysis
Sample phone bill analysisSample phone bill analysis
Sample phone bill analysis
 
Multiple regression to findout drivers of online satisfaction
Multiple regression to findout drivers of  online satisfactionMultiple regression to findout drivers of  online satisfaction
Multiple regression to findout drivers of online satisfaction
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Clustering
ClusteringClustering
Clustering
 
Decision tree
Decision treeDecision tree
Decision tree
 
Market Potential of HCL
Market Potential of HCL Market Potential of HCL
Market Potential of HCL
 
Consumer Behavior Analysis: A study of Cafe Coffee Day
Consumer Behavior Analysis: A study of Cafe Coffee DayConsumer Behavior Analysis: A study of Cafe Coffee Day
Consumer Behavior Analysis: A study of Cafe Coffee Day
 
Introduction to Pinterest
Introduction to PinterestIntroduction to Pinterest
Introduction to Pinterest
 

Recently uploaded

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Recently uploaded (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Subscription fraud analytics using classification

  • 1. SUBSCRIPTION FRAUD ANALYTICS USING CLASSIFICATION SOMDEEP KUMAR SEN Trimax Analytics and Optimization Services 2/21/2014
  • 2. Subscription Fraud Analytics Using Naïve Bayes Classifier Contents Introduction .................................................................................................................................... 3 Overview of the study ..................................................................................................................... 3 Objective of the Study .................................................................................................................... 4 Telecommunication fraud: an Overview ........................................................................................ 4 Definition .................................................................................................................................... 4 Types ........................................................................................................................................... 4 Subscription fraud ................................................................................................................... 4 Recharge Voucher Fraud......................................................................................................... 4 Pre-paid Balance Fraud ........................................................................................................... 4 Unauthorized Service Fraud.................................................................................................... 4 Models Used ................................................................................................................................... 5 Naïve Bayes Classification: an overview ..................................................................................... 5 Decision Tree (A Supervised Learning Method): ........................................................................ 5 Methodology................................................................................................................................... 6 Analysis & Findings ......................................................................................................................... 6 Using R ........................................................................................................................................ 6 Using RapidMiner...................................................................................................................... 13 Conclusion ..................................................................................................................................... 18 2
  • 3. Subscription Fraud Analytics Using Naïve Bayes Classifier Introduction The advancement of technological tools such as computers, the internet, and cellular phones has made life easier and more convenient for most people in our society. However some individuals and groups have subverted these telecommunication devices into tools to defraud numerous unsuspecting victims. It is not uncommon for a scam to originate in a city, country, state, or even a country different from that in which the victim resides. While, telecom fraud may occur in different forms, the present study would focus upon the use of analytics to detect subscription fraud. The study focuses on the application of Naïve Bayes Classification Algorithm to detect & predict probable fraudsters. Overview of the study A fictitious telecom company called Bad Idea came up with a strange rate plan called Praxis Plan where the callers are allowed to make only one call in the Morning (9AM-Noon), Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight); i.e. four calls per day. Despite the popularity of the plan, Bad Idea was a target of Subscription Fraud by a gang of fraudsters consisting of three people: Sally, Virginia and Vince. They finally terminated their services. Bad Idea has their call logs spanning over one and half months. The analytics team of the company has been provided two data sets: Black-List Subscriber CallLogs & Audit Log. The Black-List Subscriber Call-Logs data set includes the calling patterns of the three fraudsters i.e. Sally, Virginia and Vince. After every 5 days the company undertakes an audit to see whether these Fraudsters have joined their network. The company reviews the list of subscribers who have made calls to the same people as these three fraudsters and in the same time frame. This has been provided in the Audit Log. Test Data: http://bit.ly/1du9cRs Training Data: http://bit.ly/1du9AQ1 3
  • 4. Subscription Fraud Analytics Using Naïve Bayes Classifier Objective of the Study  To provide the Name of the probable callers and the confidence in terms of probability  To provide Name of the fraudster, if any  Code used to determine the subscriber Telecommunication fraud: an Overview Definition Telecommunication fraud is the theft of telecommunication service (telephones, cell phones, computers etc.) or the use of telecommunication service to commit other forms of fraud. Victims include consumers, businesses and communication service providers. Types Subscription fraud Subscriber fraud occurs when someone signs up for service with fraudulently-obtained customer information or false identification. Lawbreakers obtain your personal information and use it to set up a cell phone account in the name of the subscriber Recharge Voucher Fraud This mainly includes unusual top-up recharges and high number of recharges in a given timeperiod Pre-paid Balance Fraud Employees with high number of manual balance change as well as Subscribers with high balances might be an indication of Pre-paid Balance Fraud Unauthorized Service Fraud HLR vs. Post-paid subscriber profile reconciliation, HLR services vs. Post-paid Subscriber services Profile mis-match or sudden change in Subscriber usages could be possible indication of Unauthorized Service Fraud 4
  • 5. Subscription Fraud Analytics Using Naïve Bayes Classifier Models Used Naïve Bayes Classification: an overview A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix. Decision Tree (A Supervised Learning Method): A decision tree is a flowchart-like structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label (decision taken after computing all attributes). A path from root to leaf represents classification rules. In decision analysis a decision tree and the closely related influence diagram is used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated. 5
  • 6. Subscription Fraud Analytics Using Naïve Bayes Classifier Methodology In order to make the final prediction Naïve Bayes Classification has been conducted by using two different packages in the form of R and Rapid Miner. This has been done in order to make comparison between the results provided by the two packages. Analysis & Findings Using R Our training data (BlackListSubscriberCallLogs), in the form of an excel sheet, has 138 instances of the names of the people called by the fraudsters Sally, Vince and Virginia in each of the time frames. We import this dataset into R as “blacklisted”. We also have a file (Audit Log) of 15 instances where we predict the fraudster by the end of this report. This is our unseen data. We import this dataset into R as “audit” The Process:  Import the datasets and understand them  Install packages and load the libraries “caret” and “klaR” for Naïve Bayes and “party” for Decision Tree  Train our model(Naïve Bayes) using 10-fold cross validation  Tweak the parameters of the model to obtain finer results  Check for Accuracy and Kappa values  Compare the result of Naïve Bayes model with 10-fold cross validated Decision Tree model 6
  • 7. Subscription Fraud Analytics Using Naïve Bayes Classifier The above method is used for 10 fold cross validation, which divides the entire dataset in 9:1 parts (using 9 parts for training and 1 part for testing). It repeats this 10 times, reshuffling the data each time. The outcome of the model is after it has trained itself from all the trials. Now, we shuffle (for random sampling) our dataset (blacklisted) and take 15 observations (about 10%) to apply our model and check for the accuracy against it. This set of observation can be identified and called using the set.seed() function. 7
  • 8. Subscription Fraud Analytics Using Naïve Bayes Classifier Upon analyzing the confusion matrix, we find that:  The accuracy of our model is (6+3+2)/15 = 73.3%  Precision of predicting Sally = 6/9 = 66.66%  Precision of predicting Vince = 3/3 = 100%  Precision of predicting Virginia = 2/3 = 66.66% Now, we tweak our model, using Laplace (fL) and usekernel. Laplace is a smoothing technique that assigns non-zero probability to events that do not occur in a sample. Usekernel is another smoothing technique which is a non-parametric way to estimate the probability density function of a random variable. 8
  • 9. Subscription Fraud Analytics Using Naïve Bayes Classifier 9
  • 10. Subscription Fraud Analytics Using Naïve Bayes Classifier We observe that the outcome of both the models fit and fit1 (with Laplace and usekernel) are identical in this case. Thus, with 73% accuracy, we apply our model to the unseen data (audit). To obtain the posterior probabilities for each set of observation in the unseen data, we type the following command: 10
  • 11. Subscription Fraud Analytics Using Naïve Bayes Classifier Now, we build the Decision Tree Model, > plot (ctreeFit$finalModel) 11
  • 12. Subscription Fraud Analytics Using Naïve Bayes Classifier Reading the graph: In the evening, if the call is made to Frank and at night, to Clark, and in the morning the call reaches either, Kelly, Larry or Robert, the probability of the caller fraudster being Vince is close to 80%. The disadvantage of the Decision Tree is that the name of Sally shows below each of the nodes, irrespective of the correct name. We also find that the accuracy of Naïve Bayes (0.661) is better than the Decision Tree (0.578). Thus, we now compare the Naïve Bayes output as obtained by Rapid Miner against the output of R. 12
  • 13. Subscription Fraud Analytics Using Naïve Bayes Classifier Using RapidMiner Initially both the data sets are uploaded into Rapid Miner. The Black-List Subscriber Call-Logs is named as telecom234 and the Audit log is named as telecom56. Many different classification algorithms are available in Rapid Miner; out of all we choose the Naïve Bayes Classifier. Both the telecom234 is dragged into the main window. In data mining we use the concept of data splitting. In data splitting, we divide our data set into two parts, i.e., training set and validation set. The purpose of training set is to create model, whereas validation set is used to estimate the accuracy of the created model. To create model and to estimate its accuracy using the data splitting technique, we use validation operators that can be found in Evaluation -> Validation folder in operator window. Most commonly used operators are Split Validation and XValidation. We first make use of Split Validation. Drag and drop Split Validation operator into process window. Split Validation operator is a group operator, i.e., it groups multiple operators in it. Group operators have a special sign on them; they have two overlapping blue squares on their icon as shown in figure below. 13
  • 14. Subscription Fraud Analytics Using Naïve Bayes Classifier Validation operator has split ratio parameter (visible in parameter window on right), which specifies how data set will be split. 0.7 in figure above will split data set into 70% of data for training set and remaining 30% of data for testing set. Now double click Validation operator. The Validation sub process window has two parts, i.e., Training and Testing. The split validation operator is a nested one & we double click on it. Now the Naïve Bayes Classification is entered into the training window. Validation allows us to estimate the accuracy of our model. For this purpose, Rapid Miner provides many Performance Operators in the Performance Measurement folder. Apply Model and Performance (Classification) operators in testing window as shown in figure below 14
  • 15. Subscription Fraud Analytics Using Naïve Bayes Classifier Now the averagable ports of validation operator with result port of Process window are connected. From the result perspective one would get a performance vector with details about our created model performance. For example, the created model has the accuracy of almost 71% as shown in figure below, which is quite good. 15
  • 16. Subscription Fraud Analytics Using Naïve Bayes Classifier Once the model is created, it is time for using the model to perform classification/prediction. The telecom56 data set is dragged into the main window. Telecom56 data set is the unlabeled data set. Apply Model operator is also dragged into the main window. Apply Model operator will get model from the Validation operator and will apply this model on input of un-labeled data i.e. telecom56 data which is shown in the figure below. 16
  • 17. Subscription Fraud Analytics Using Naïve Bayes Classifier Now, running the whole process would provide the prediction as shown the figure below in the form of the name of the probable callers along with the confidence in terms of probability 17
  • 18. Subscription Fraud Analytics Using Naïve Bayes Classifier Conclusion Comparing the accuracy and precision from the confusion matrix of Rapid Miner and R results, we see: For R:  The accuracy of our model is (6+3+2)/15 = 73.3%  Precision of predicting Sally = 6/9 = 66.66%  Precision of predicting Vince = 3/3 = 100%  Precision of predicting Virginia = 2/3 = 66.66% Rapid Miner:  The accuracy of our model is = 70.73%  Precision of predicting Sally = 81.82%  Precision of predicting Vince =58.33%  Precision of predicting Virginia = 72.22% Since, both the statistical software gives accuracy above 70%, we can be confident about our model and come to the conclusion that Naïve Bayes may be considered the best classifier in this case, where the training data is considerably small and categorical. 18