SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Digit Recognizer (Machine Learning)

           Internship Project Report




               Submitted to:

          Persistent System Limited

Product Engineering Unit-1 for Data Mining, Pune




                 Submitted By:-

                  Amit Kumar

        PGPBA Praxis Business School, Kolkata

          Project Mentor:-Mr. Yogesh Badhe

       (Technical Specialist, Product Engineering Unit-1)

          Start Date of Internship: 16th July 2012

          End date of Internship: 15th October 2012

                Report Date: 15th October 2012
Preface


This report documents is the work done during the summer internship at Persistent System Limited,
Pune on the classify handwritten digits, under the supervision of Mr. Yohesh Badhe. The report will
give an overview of the tasks completed during the period of internship with technical details. Then the
results obtained are discussed and analyzed. I have tried my best to keep report simple yet technically
correct. I hope I succeed in my attempt.


Amit Kumar
ACKNOWLEDGEMENT




Simply, I could not have done this work without the lots of help I received cheerfully from Data Mining
Team. The work culture in Persistent System Limited is really motivates.Everybody is such a friendly
and cheerful companion here that work stress is never comes in way. I would specially like to thank
Mr. Mukund Deshpande who gave me this project to learn and understand the business implication of
statistical algorithm. Once again I would be thankful to Mr. Yogesh Badhe who helped me from the
understanding of the project to the building the statistical model. He not only advised me in the project,
but listened my arguments in our discussion. I am also very thankful to Ms. Deepti Takale who helped
me lot to absorb the statistical concepts.

Amit Kumar
Abstract

The report presents the three tasks completed during summer internship at Persistent System Limited

Which are listed below:
1. Understand of the Problem objective & business implication
2. Understanding the data & build the model
3. Evaluation of the model
All these tasks have been completed successfully and results were according to Expectations. All the
tasks were need very systematic approach, starting from the behavior of the data to the application of
the algorithm and till evaluation of the model. The most challenging task was the domain knowledge, to
understand the behavior of the data. Once the data has been prepared, we applied statistical algorithm
for model building. It is one of the major area and really need very fundamental and conceptual
knowledge of Advanced Statistics.

Amit Kumar
Introduction:- This project is taken from kaggle. It is a platform for predictive modeling and Analytics
competitions. Here organization and researchers post the data. Statisticians and data scientist from all
over the world compete to produce the best models.

Problem Statement:- There is a image of hand written digit and each image is 28pixels in height & 28 pixels
in width, for a total of 784 pixels. Each pixel has a single pixel value associated with it, indicating the lightness
or darkness of that pixel, with higher meaning darker. The pixel value is an integer between 0 and 255,
inclusive. Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and
783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i
and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28
matrix, (indexing by zero).

Goal of the Competition:- take an image of a handwritten single digit, and determine what that digit is?

Approach for the model building:-

Develop the Analysis Plan:- For the conceptual model establishment, we need to understand the selected
techniques and model implementation issue. Here we will establish predictive model, which will be in based
on random forest algorithm. This will our model consideration based upon sample size and required type of
variable(metric versus nonmetric).

Evaluation of Underlying techniques:-Since all multivariate techniques rely on underlying assumption, both
statistical & conceptual, that substantially affect their ability to represent multivariate relationships. For the
techniques based on the statistical inference, the assumptions of the multivariate normality, linearity,
independence of the error term, and equality of variance in a dependence relationship must all be met. Since
our data is categorical so we do not need to identify the linearity or any independence relation.

Estimate the Model and Assess Overall Model Fit:- With the assumption satisfied, the analysis proceeds to
the actual estimation of the model and assessment of overall model fit. In the estimation process we can
choose option to meet specific characteristics of the data or to maximize the fit of data. After the model is
estimated, the overall model fit is evaluated to ascertain whether it achieves acceptable levels on statistical
criteria. Many times, the model will be specified in an attempt to better level of overall fit or explanation.

Interpret the variate:- With the acceptable level of model fit, interpreting the variate reveals the nature of
relationship. The interpretation of effects for individual variables is made by examining the estimated weight
for each variable in the variate.

Validate the Model:-Before accepting the results, we must subject them to one final set of diagnostic
analyses that assess the degree of generalization of the results by the available validation methods. The
attempt to validate the model is directed towards demonstrating the generalization of the results to the total
population. These diagnostic analyses add little to the interpretation of the results but can be viewed as
insurance that the results are the most descriptive of the data.
Required statistical concepts for this project:-

Data Mining:-In another words we say it Knowledge Discovery in Database. It is a field at the interaction
of computer science and statistics to attempt to discover the pattern in large data set. It utilize
methods at the intersection of artificial intelligence , machine learning, statistics and database system.
The overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for future use.

Decision Tree:- Decision tree can be used to predict a pattern or to classify the class of a data. It is
commonly used in data mining. The goal to use the decision tree algorithm is to create the model that
predict the target variable based upon the several input variables. In decision tree each leaf represents a
value of target variable given the value of input variables represented by the path from the root of the
leaf. A tree can be learned by splitting the source set into the subset based on an attribute value test.
This process repeated in each derived subset called recursive. The general fashion for tree is top down
induction.

Decision tree used in data mining are of two main types:-

    1. Classification Tree:- When the predicted outcome is the class to which the data belongs.
    2. Regression Tree:- When the predicted outcome can be consider a real number.

The term classification & regression tree(CART) analysis is an umbrella term used to refer to both of the
above procedure.

Some other techniques constructs more than one decision tree like, Bagging, Random Forest, Boosted
tree etc. We have used Random Forest decision tree for this project. The algorithm that are used for
constructing decision trees usually work top-down by choosing a variable at each step, that is next best
variable to use in splitting the set of variables. “Best” is defined by how well the variable splits the set
into homogeneous subsets that have the same value as target variable. Different algorithm use
different formulae for measuring “Best”.

These are the mathematical function through we measure the Impurity.
Random Forest:- Recently lot of interest in “ensemble learning” methods that generates many classifier
and aggregates their results. Two well-known methods are Boosting & bagging of classification trees. In
Boosting Successive tree give extra weight to points incorrectly predicted by earlier predictor. In the
End, a weighted vote is taken for prediction. In bagging, successive trees do not depend on
earlier trees each is independently constructed using a bootstrap sample of the data set. In the
end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests,
which add an additional layer of randomness to bagging. In addition to constructing each tree
using a different bootstrap sample of the data, random forests change how the classification or
regression trees are constructed. In standard trees, each node is split using the best split among
all variables. In a random forest, each node is split using the best among a subset of predictors
randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform
Very well compared to many other classifiers, including discriminant analysis, support vector
machines and neural networks, and is robust against over fitting. In addition, it is very user-
friendly in the sense that it has only two parameters (the number of variables in the random
subset at each node and the number of trees in the forest), and is usually not very sensitive to
their value.

The algorithm:-
The random forests algorithm (for both classification and regression) is as follows:
1. Draw ntree bootstrap samples from the original data.
2. For each of the bootstrap samples, grow an unpruned classification or regression tree, with
the following modification: at each node, rather than choosing the best split among all
predictors, randomly sample mtry of the predictors and choose the best split from among those
Variables. (Bagging can be thought of as the special case of random forests obtained when mtry
= p, the number of predictors.)
3. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for
Classification, Average for regression). An estimate of the error rate can be obtained, based on
the training data, by the following:
1. At each bootstrap iteration, predict the data not in the bootstrap sample (what Breiman
Calls “out-of-bag”, or OOB, data) using the tree grown with the bootstrap sample.
2. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag
around 36% of the times, so aggregate these predictions.) Calculate the error rate, and call it
the OOB estimate of error rate.
Source Code:- # makes the random forest submission

library(randomForest)

train <- read.csv("../data/train.csv", header=TRUE)

test <- read.csv("../data/test.csv", header=TRUE)

labels <- as.factor(train[,1])

train <- train[,-1]

rf <- randomForest(train, labels, xtest=test, ntree=1000)

predictions <- levels(labels)[rf$test$predicted]

write(predictions, file="rf_benchmark.csv", ncolumns=1)
Solutions:--For this train data set we are taking the five random sample(20percent) and making five
different model. Since our data set is large we need to combine the respective model results to validate the
overall model accuracy.. Our approach can be vary, like we can build a model on the 80percent of the train
data and keep 20percent of the data to validate the model.

Model-1(Random Forest Algorithm,
sample=train.csv)
OOB estimation of error rate :- 5.32%
                                     Confusion Matrix
      0     1     2     3       4      5       6       7        8      9 class.error
0   848     0     0     1       3      0       6       0        8      0             0.0207852
1     0   933     8     3       3      1       2       1        0      1              0.019958
2    10     2   765     6       4      0       5       7        9      2             0.0555556
3     2     3    20   812       3     18       2       9       17      4             0.0876405
4     2     1     0     0     787      0       6       3        0     23             0.0425791
5     7     2     1    20       0    687      13       0        4      7             0.0728745
6     9     2     1     0       5     12    762        0        2      0             0.0390921
7     1     6    12     1       6      2       1     870        5     14             0.0522876
8     2     6     8    16       4      8       2       3     744      11             0.0746269
9     5     4     1    11      16      3       2       7       10    745             0.0733831
                                                                    8400
Model-2(RandomForest Algorithm, sample=train1.csv)
OOB estimation of error :-5.14%
                                  Confusion matrix
      0     1     2     3      4      5       6        7      8      9 class.error
0   773     0     1     0      2      0       7        0      5      0          0.0190355
1     0   946     3     5      2      1       1        2      3      0          0.0176532
2     3     1   766     4     11      1       6       11      6      1           0.054321
3     2     4    17   767      1     14       2        2     16      8          0.0792317
4     5     1     1     0    774      0       8        0      2     20          0.0456227
5     7     5     1    19      3    732       3        1      5      9          0.0675159
6     7     2     0     1      2      6    821         0      2      0          0.0237812
7     1     3    12     2      7      0       0     879       5     20          0.0538213
8     2     9     4     9      4     14       6        0    716     15          0.0808729
9     7     3     2    14     18      1       1       14      7    794          0.0778165
                                                                  8400
Model-3(RandomForest Algorithm , sample=train2.csv)
OOB estimation of error:-5.63%
                                  Confusion matrix
      0     1     2     3     4       5       6        7     8      9 class.error
0   864     0     2     0     1       1       4        0     7      1          0.0181818
1     0   896     6     2     4       2       1        3     4      2           0.026087
2     6     4   810     6     8       2       8      11      5      0          0.0581395
3     5     3    18   788     1      14       3      11     18      5          0.0900693
4     3     3     3     1   714       2       4        2     5     21          0.0580475
5     7     3     0    18     2     709       9        1     6      4          0.0658762
6     7     2     0     0     4       9    790         0     3      0          0.0306749
7     1     7    11     0     6       0       0     838      3     24           0.058427
8     4    11     5    18     4       9       3        3   745     13          0.0858896
9     8     5     2    14    14       2       0      10      9    773          0.0764636
                                                                 8400
Model 4:- (RandomForest Algorithm Used)
OOB estimation of the error:-5.35%
                                    Confusion matrix
      0     1     2     3      4        5       6        7      8      9 class.error
0   778     0     0     0      0        4       3        0      4      0           0.0139417
1     0   956     6     1      2        1       2        2      0      1            0.015448
2     7     5   759    10      8        0       7      12      10      3           0.0755177
3     1     3    12   812      1      24        1        7      9      4           0.0709382
4     1     3     0     0    788        0       6        3      2     22           0.0448485
5     6     4     1    19      3     710        8        1      7      6           0.0718954
6     7     2     1     0      2        5     800        0      2      0            0.023199
7     1     4     9     2      7        0       0     830       0     26           0.0557452
8     2     8     3    14      2      10        6        2    746     15           0.0767327
9     5     0     7    15     24        2       2      11      11    772           0.0906949
                                                                    8400
Model 5:-(RandomForest Algorithm Used)
OOB estimation of the error:-5.30%
                                    Confusion matrix
      0     1     2     3       4       5       6        7      8      9 class.error
0   782     0     0     1       1       1       4        1      5      1          0.0175879
1     0   927     5     5       1       2       3        1      1      1          0.0200846
2     6     1   802     8       6       0       5      14       5      3          0.0564706
3     1     2     9   815       2     20        2        9     11      7           0.071754
4     1     3     1     0     776       0      12        2      5     21          0.0548112
5     6     4     2    12       4    696       10        0      5      6          0.0657718
6     6     1     5     0       3       8    796         0      2      0          0.0304507
7     1     9    14     1       7       1       0     825       4      8          0.0517241
8     0     8    11    15       8     11        7        0    727     17          0.0957711
9     5     2     0    12      12       4       1      11      13    809          0.0690449
                                                                    8400
Model Validation:-
Overall Model Accuracy:-

         Predicted                               OOB estimated error 3.32%
Actual           0      1      2      3      4        5      6        7     8       9 Class error
    0        4074       0      2      1      3        4     18        1    16       0 0.010898523
    1            0   4686     20      7      8        4     10        6     7       4 0.013888889
    2           14     12   3992     19     26        7     18       32    25       6 0.038304023
    3            3      3     44   4144      2      62       7       17    43      16 0.045381249
    4            4      8      4      1   3916        0     16        6    10      72 0.029972752
    5           21     14      2     30      5    3656      29        3    20      15 0.036627141
    6           24      5      3      0      9      25 4017           0     6       0 0.017608217
    7            7     18     37      7     20        1      0 4327         8      61 0.035443602
    8            5     30     18     45     20      28      18        3 3811       32 0.049625935
    9           20      5      9     47     51        6      2       31    20    4029 0.045260664
                                                                                42000
Interpretation of the Model:-
The prediction of model in the test data set taken as the collective results of all the five model and the
weighted average taken to determine the best fit of result. Based upon the confusion matrix and
recursive pattern of data set to build the model is show that the average confidence is .954. Here we
have avoided over fitting because in the Random Forest algorithm data taken to build the model is
random & unbiased. Here the interesting thing is Random Forest can handle missing values and it
doesn’t require the pruning. In random Forest roughly 30-35% of the samples are not selected in
bootstrap, which we call as (OOB) sample. Using OOB sample as input to the corresponding tree,
predictions are made.



Bibliography:-

Multivariate Data Analysis by:- Hair black & tatham

http://www.webchem.science.ru.nl:8080/PRiNS/rF.pdf

http://people.revoledu.com/kardi/tutorial/DecisionTree/index.html

http://www.statmethods.net/interface/workspace.html

Weitere ähnliche Inhalte

Was ist angesagt?

project sentiment analysis
project sentiment analysisproject sentiment analysis
project sentiment analysis
sneha penmetsa
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 

Was ist angesagt? (20)

project sentiment analysis
project sentiment analysisproject sentiment analysis
project sentiment analysis
 
Book Selling Website Report
Book Selling Website ReportBook Selling Website Report
Book Selling Website Report
 
Billing and Invoice Management System
Billing and Invoice Management SystemBilling and Invoice Management System
Billing and Invoice Management System
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 
ELECTRICITY BILLING SYSTEM (1).pptx
ELECTRICITY BILLING SYSTEM (1).pptxELECTRICITY BILLING SYSTEM (1).pptx
ELECTRICITY BILLING SYSTEM (1).pptx
 
Smart India Hackathon Idea Submission
Smart India Hackathon Idea SubmissionSmart India Hackathon Idea Submission
Smart India Hackathon Idea Submission
 
Internship report on AI , ML & IIOT and project responses full docs
Internship report on AI , ML & IIOT and project responses full docsInternship report on AI , ML & IIOT and project responses full docs
Internship report on AI , ML & IIOT and project responses full docs
 
Ppt on data science
Ppt on data science Ppt on data science
Ppt on data science
 
E healthcare
E healthcare E healthcare
E healthcare
 
IoT in Healthcare
IoT in HealthcareIoT in Healthcare
IoT in Healthcare
 
online marketplace report
online marketplace reportonline marketplace report
online marketplace report
 
Mini Project PPT
Mini Project PPTMini Project PPT
Mini Project PPT
 
Harsh Mathur Restaurant Billing System project ppt
Harsh Mathur Restaurant Billing System project pptHarsh Mathur Restaurant Billing System project ppt
Harsh Mathur Restaurant Billing System project ppt
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Tree pruning
 Tree pruning Tree pruning
Tree pruning
 
inventory management system
 inventory management system inventory management system
inventory management system
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 
Speech emotion recognition
Speech emotion recognitionSpeech emotion recognition
Speech emotion recognition
 

Andere mochten auch (6)

internship1
internship1internship1
internship1
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
In-plant Training Guidelines_SCSE
In-plant Training Guidelines_SCSEIn-plant Training Guidelines_SCSE
In-plant Training Guidelines_SCSE
 
Pradeep_Chaudhari_Report
Pradeep_Chaudhari_ReportPradeep_Chaudhari_Report
Pradeep_Chaudhari_Report
 
Internship Report
Internship ReportInternship Report
Internship Report
 
Summer internship project report
Summer internship project reportSummer internship project report
Summer internship project report
 

Ähnlich wie Internship project report,Predictive Modelling

Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
ssuser33da69
 
Final Report
Final ReportFinal Report
Final Report
imu409
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docx
daniahendric
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 

Ähnlich wie Internship project report,Predictive Modelling (20)

Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
Real Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningReal Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine Learning
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
Regression and Artificial Neural Network in R
Regression and Artificial Neural Network in RRegression and Artificial Neural Network in R
Regression and Artificial Neural Network in R
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
CUSTOMER CHURN PREDICTION
CUSTOMER CHURN PREDICTIONCUSTOMER CHURN PREDICTION
CUSTOMER CHURN PREDICTION
 
IRJET- Analysis of Music Recommendation System using Machine Learning Alg...
IRJET-  	  Analysis of Music Recommendation System using Machine Learning Alg...IRJET-  	  Analysis of Music Recommendation System using Machine Learning Alg...
IRJET- Analysis of Music Recommendation System using Machine Learning Alg...
 
Final Report
Final ReportFinal Report
Final Report
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docx
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 

Mehr von Amit Kumar

Mehr von Amit Kumar (14)

Churn model for telecom
Churn model for telecomChurn model for telecom
Churn model for telecom
 
A case study on churn analysis1
A case study on churn analysis1A case study on churn analysis1
A case study on churn analysis1
 
A project report on declined sales of citrus
A project report on declined sales of citrusA project report on declined sales of citrus
A project report on declined sales of citrus
 
BI Approach
BI ApproachBI Approach
BI Approach
 
Big data analytics final
Big data analytics finalBig data analytics final
Big data analytics final
 
Designing of Dataware-house
Designing of Dataware-houseDesigning of Dataware-house
Designing of Dataware-house
 
Mc donalds part 2
Mc donalds part 2Mc donalds part 2
Mc donalds part 2
 
Mc donalds part 1
Mc donalds part 1Mc donalds part 1
Mc donalds part 1
 
Data analytics for marketing decision support
Data analytics for marketing decision supportData analytics for marketing decision support
Data analytics for marketing decision support
 
Planning & Budgeting for ITC retails
Planning & Budgeting for ITC retailsPlanning & Budgeting for ITC retails
Planning & Budgeting for ITC retails
 
Strategy to launch a basketball League
Strategy to launch a basketball LeagueStrategy to launch a basketball League
Strategy to launch a basketball League
 
A report on merger & acquisition of united
A report on merger & acquisition of unitedA report on merger & acquisition of united
A report on merger & acquisition of united
 
Market Research Project on small car segment
Market Research Project on small car segmentMarket Research Project on small car segment
Market Research Project on small car segment
 
Fsa presentation
Fsa presentationFsa presentation
Fsa presentation
 

Internship project report,Predictive Modelling

  • 1. Digit Recognizer (Machine Learning) Internship Project Report Submitted to: Persistent System Limited Product Engineering Unit-1 for Data Mining, Pune Submitted By:- Amit Kumar PGPBA Praxis Business School, Kolkata Project Mentor:-Mr. Yogesh Badhe (Technical Specialist, Product Engineering Unit-1) Start Date of Internship: 16th July 2012 End date of Internship: 15th October 2012 Report Date: 15th October 2012
  • 2. Preface This report documents is the work done during the summer internship at Persistent System Limited, Pune on the classify handwritten digits, under the supervision of Mr. Yohesh Badhe. The report will give an overview of the tasks completed during the period of internship with technical details. Then the results obtained are discussed and analyzed. I have tried my best to keep report simple yet technically correct. I hope I succeed in my attempt. Amit Kumar
  • 3. ACKNOWLEDGEMENT Simply, I could not have done this work without the lots of help I received cheerfully from Data Mining Team. The work culture in Persistent System Limited is really motivates.Everybody is such a friendly and cheerful companion here that work stress is never comes in way. I would specially like to thank Mr. Mukund Deshpande who gave me this project to learn and understand the business implication of statistical algorithm. Once again I would be thankful to Mr. Yogesh Badhe who helped me from the understanding of the project to the building the statistical model. He not only advised me in the project, but listened my arguments in our discussion. I am also very thankful to Ms. Deepti Takale who helped me lot to absorb the statistical concepts. Amit Kumar
  • 4. Abstract The report presents the three tasks completed during summer internship at Persistent System Limited Which are listed below: 1. Understand of the Problem objective & business implication 2. Understanding the data & build the model 3. Evaluation of the model All these tasks have been completed successfully and results were according to Expectations. All the tasks were need very systematic approach, starting from the behavior of the data to the application of the algorithm and till evaluation of the model. The most challenging task was the domain knowledge, to understand the behavior of the data. Once the data has been prepared, we applied statistical algorithm for model building. It is one of the major area and really need very fundamental and conceptual knowledge of Advanced Statistics. Amit Kumar
  • 5. Introduction:- This project is taken from kaggle. It is a platform for predictive modeling and Analytics competitions. Here organization and researchers post the data. Statisticians and data scientist from all over the world compete to produce the best models. Problem Statement:- There is a image of hand written digit and each image is 28pixels in height & 28 pixels in width, for a total of 784 pixels. Each pixel has a single pixel value associated with it, indicating the lightness or darkness of that pixel, with higher meaning darker. The pixel value is an integer between 0 and 255, inclusive. Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero). Goal of the Competition:- take an image of a handwritten single digit, and determine what that digit is? Approach for the model building:- Develop the Analysis Plan:- For the conceptual model establishment, we need to understand the selected techniques and model implementation issue. Here we will establish predictive model, which will be in based on random forest algorithm. This will our model consideration based upon sample size and required type of variable(metric versus nonmetric). Evaluation of Underlying techniques:-Since all multivariate techniques rely on underlying assumption, both statistical & conceptual, that substantially affect their ability to represent multivariate relationships. For the techniques based on the statistical inference, the assumptions of the multivariate normality, linearity, independence of the error term, and equality of variance in a dependence relationship must all be met. Since our data is categorical so we do not need to identify the linearity or any independence relation. Estimate the Model and Assess Overall Model Fit:- With the assumption satisfied, the analysis proceeds to the actual estimation of the model and assessment of overall model fit. In the estimation process we can choose option to meet specific characteristics of the data or to maximize the fit of data. After the model is estimated, the overall model fit is evaluated to ascertain whether it achieves acceptable levels on statistical criteria. Many times, the model will be specified in an attempt to better level of overall fit or explanation. Interpret the variate:- With the acceptable level of model fit, interpreting the variate reveals the nature of relationship. The interpretation of effects for individual variables is made by examining the estimated weight for each variable in the variate. Validate the Model:-Before accepting the results, we must subject them to one final set of diagnostic analyses that assess the degree of generalization of the results by the available validation methods. The attempt to validate the model is directed towards demonstrating the generalization of the results to the total population. These diagnostic analyses add little to the interpretation of the results but can be viewed as insurance that the results are the most descriptive of the data.
  • 6. Required statistical concepts for this project:- Data Mining:-In another words we say it Knowledge Discovery in Database. It is a field at the interaction of computer science and statistics to attempt to discover the pattern in large data set. It utilize methods at the intersection of artificial intelligence , machine learning, statistics and database system. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for future use. Decision Tree:- Decision tree can be used to predict a pattern or to classify the class of a data. It is commonly used in data mining. The goal to use the decision tree algorithm is to create the model that predict the target variable based upon the several input variables. In decision tree each leaf represents a value of target variable given the value of input variables represented by the path from the root of the leaf. A tree can be learned by splitting the source set into the subset based on an attribute value test. This process repeated in each derived subset called recursive. The general fashion for tree is top down induction. Decision tree used in data mining are of two main types:- 1. Classification Tree:- When the predicted outcome is the class to which the data belongs. 2. Regression Tree:- When the predicted outcome can be consider a real number. The term classification & regression tree(CART) analysis is an umbrella term used to refer to both of the above procedure. Some other techniques constructs more than one decision tree like, Bagging, Random Forest, Boosted tree etc. We have used Random Forest decision tree for this project. The algorithm that are used for constructing decision trees usually work top-down by choosing a variable at each step, that is next best variable to use in splitting the set of variables. “Best” is defined by how well the variable splits the set into homogeneous subsets that have the same value as target variable. Different algorithm use different formulae for measuring “Best”. These are the mathematical function through we measure the Impurity.
  • 7. Random Forest:- Recently lot of interest in “ensemble learning” methods that generates many classifier and aggregates their results. Two well-known methods are Boosting & bagging of classification trees. In Boosting Successive tree give extra weight to points incorrectly predicted by earlier predictor. In the End, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform Very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against over fitting. In addition, it is very user- friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their value. The algorithm:- The random forests algorithm (for both classification and regression) is as follows: 1. Draw ntree bootstrap samples from the original data. 2. For each of the bootstrap samples, grow an unpruned classification or regression tree, with the following modification: at each node, rather than choosing the best split among all predictors, randomly sample mtry of the predictors and choose the best split from among those Variables. (Bagging can be thought of as the special case of random forests obtained when mtry = p, the number of predictors.) 3. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for Classification, Average for regression). An estimate of the error rate can be obtained, based on the training data, by the following: 1. At each bootstrap iteration, predict the data not in the bootstrap sample (what Breiman Calls “out-of-bag”, or OOB, data) using the tree grown with the bootstrap sample. 2. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag around 36% of the times, so aggregate these predictions.) Calculate the error rate, and call it the OOB estimate of error rate.
  • 8. Source Code:- # makes the random forest submission library(randomForest) train <- read.csv("../data/train.csv", header=TRUE) test <- read.csv("../data/test.csv", header=TRUE) labels <- as.factor(train[,1]) train <- train[,-1] rf <- randomForest(train, labels, xtest=test, ntree=1000) predictions <- levels(labels)[rf$test$predicted] write(predictions, file="rf_benchmark.csv", ncolumns=1)
  • 9. Solutions:--For this train data set we are taking the five random sample(20percent) and making five different model. Since our data set is large we need to combine the respective model results to validate the overall model accuracy.. Our approach can be vary, like we can build a model on the 80percent of the train data and keep 20percent of the data to validate the model. Model-1(Random Forest Algorithm, sample=train.csv)
  • 10. OOB estimation of error rate :- 5.32% Confusion Matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 848 0 0 1 3 0 6 0 8 0 0.0207852 1 0 933 8 3 3 1 2 1 0 1 0.019958 2 10 2 765 6 4 0 5 7 9 2 0.0555556 3 2 3 20 812 3 18 2 9 17 4 0.0876405 4 2 1 0 0 787 0 6 3 0 23 0.0425791 5 7 2 1 20 0 687 13 0 4 7 0.0728745 6 9 2 1 0 5 12 762 0 2 0 0.0390921 7 1 6 12 1 6 2 1 870 5 14 0.0522876 8 2 6 8 16 4 8 2 3 744 11 0.0746269 9 5 4 1 11 16 3 2 7 10 745 0.0733831 8400
  • 12. OOB estimation of error :-5.14% Confusion matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 773 0 1 0 2 0 7 0 5 0 0.0190355 1 0 946 3 5 2 1 1 2 3 0 0.0176532 2 3 1 766 4 11 1 6 11 6 1 0.054321 3 2 4 17 767 1 14 2 2 16 8 0.0792317 4 5 1 1 0 774 0 8 0 2 20 0.0456227 5 7 5 1 19 3 732 3 1 5 9 0.0675159 6 7 2 0 1 2 6 821 0 2 0 0.0237812 7 1 3 12 2 7 0 0 879 5 20 0.0538213 8 2 9 4 9 4 14 6 0 716 15 0.0808729 9 7 3 2 14 18 1 1 14 7 794 0.0778165 8400
  • 13. Model-3(RandomForest Algorithm , sample=train2.csv)
  • 14. OOB estimation of error:-5.63% Confusion matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 864 0 2 0 1 1 4 0 7 1 0.0181818 1 0 896 6 2 4 2 1 3 4 2 0.026087 2 6 4 810 6 8 2 8 11 5 0 0.0581395 3 5 3 18 788 1 14 3 11 18 5 0.0900693 4 3 3 3 1 714 2 4 2 5 21 0.0580475 5 7 3 0 18 2 709 9 1 6 4 0.0658762 6 7 2 0 0 4 9 790 0 3 0 0.0306749 7 1 7 11 0 6 0 0 838 3 24 0.058427 8 4 11 5 18 4 9 3 3 745 13 0.0858896 9 8 5 2 14 14 2 0 10 9 773 0.0764636 8400
  • 15. Model 4:- (RandomForest Algorithm Used)
  • 16. OOB estimation of the error:-5.35% Confusion matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 778 0 0 0 0 4 3 0 4 0 0.0139417 1 0 956 6 1 2 1 2 2 0 1 0.015448 2 7 5 759 10 8 0 7 12 10 3 0.0755177 3 1 3 12 812 1 24 1 7 9 4 0.0709382 4 1 3 0 0 788 0 6 3 2 22 0.0448485 5 6 4 1 19 3 710 8 1 7 6 0.0718954 6 7 2 1 0 2 5 800 0 2 0 0.023199 7 1 4 9 2 7 0 0 830 0 26 0.0557452 8 2 8 3 14 2 10 6 2 746 15 0.0767327 9 5 0 7 15 24 2 2 11 11 772 0.0906949 8400
  • 18. OOB estimation of the error:-5.30% Confusion matrix 0 1 2 3 4 5 6 7 8 9 class.error 0 782 0 0 1 1 1 4 1 5 1 0.0175879 1 0 927 5 5 1 2 3 1 1 1 0.0200846 2 6 1 802 8 6 0 5 14 5 3 0.0564706 3 1 2 9 815 2 20 2 9 11 7 0.071754 4 1 3 1 0 776 0 12 2 5 21 0.0548112 5 6 4 2 12 4 696 10 0 5 6 0.0657718 6 6 1 5 0 3 8 796 0 2 0 0.0304507 7 1 9 14 1 7 1 0 825 4 8 0.0517241 8 0 8 11 15 8 11 7 0 727 17 0.0957711 9 5 2 0 12 12 4 1 11 13 809 0.0690449 8400
  • 20. Overall Model Accuracy:- Predicted OOB estimated error 3.32% Actual 0 1 2 3 4 5 6 7 8 9 Class error 0 4074 0 2 1 3 4 18 1 16 0 0.010898523 1 0 4686 20 7 8 4 10 6 7 4 0.013888889 2 14 12 3992 19 26 7 18 32 25 6 0.038304023 3 3 3 44 4144 2 62 7 17 43 16 0.045381249 4 4 8 4 1 3916 0 16 6 10 72 0.029972752 5 21 14 2 30 5 3656 29 3 20 15 0.036627141 6 24 5 3 0 9 25 4017 0 6 0 0.017608217 7 7 18 37 7 20 1 0 4327 8 61 0.035443602 8 5 30 18 45 20 28 18 3 3811 32 0.049625935 9 20 5 9 47 51 6 2 31 20 4029 0.045260664 42000
  • 21. Interpretation of the Model:- The prediction of model in the test data set taken as the collective results of all the five model and the weighted average taken to determine the best fit of result. Based upon the confusion matrix and recursive pattern of data set to build the model is show that the average confidence is .954. Here we have avoided over fitting because in the Random Forest algorithm data taken to build the model is random & unbiased. Here the interesting thing is Random Forest can handle missing values and it doesn’t require the pruning. In random Forest roughly 30-35% of the samples are not selected in bootstrap, which we call as (OOB) sample. Using OOB sample as input to the corresponding tree, predictions are made. Bibliography:- Multivariate Data Analysis by:- Hair black & tatham http://www.webchem.science.ru.nl:8080/PRiNS/rF.pdf http://people.revoledu.com/kardi/tutorial/DecisionTree/index.html http://www.statmethods.net/interface/workspace.html