SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
ITB TERM PAPER




DATA MINING TECHNIQUES
(LINEAR MODELLING AND CLASSIFICATION)

RAHUL MAHAJAN (10BM60066)
Table of Contents
INTRODUCTION ............................................................................................................................................. 3
    ABOUT WEKA ............................................................................................................................................ 3
    ABOUT R .................................................................................................................................................... 3
LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE ..................................... 4
    DATA ......................................................................................................................................................... 4
CASE 1 ........................................................................................................................................................... 5
    THE CODE .................................................................................................................................................. 5
    THE RESULT ............................................................................................................................................... 5
    INTERPRETATION OF THE RESULT............................................................................................................. 6
CASE 2 ........................................................................................................................................................... 7
    THE CODE .................................................................................................................................................. 7
    THE RESULT ............................................................................................................................................... 7
    INTERPRETATION OF THE RESULT............................................................................................................. 9
CLASSIFICATION .......................................................................................................................................... 10
    THE DATASET .......................................................................................................................................... 10
    CLASSIFICATION PROCEDURE ................................................................................................................. 10
    INTERPRETING THE RESULTS .................................................................................................................. 11




2
INTRODUCTION
In this term paper I have demonstrated two data mining techniques

       LINEAR MODELLING TECHNIQUE
           o The linear modelling technique is demonstrated using R.
       CLASSIFICATION
           o The classification technique is demonstrated using WEKA


ABOUT WEKA
Weka is java based collection of open source of many data mining and machine learning
algorithms, including
           o Pre-processing on data
           o Classification:
           o Clustering
           o Association rule extraction


ABOUT R
R is an open source programming language and software environment for statistical
computing and graphics. The R language is widely used among statisticians for developing
statistical software and data analysis




3
LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF
FUTURE SHAR PRICE

Here I will try to use GARCH model to predict future share price. GARCH Models gives us
liberty to define model using previous share prices and volatility for a defined period. There are
many versions of GARCH Models to give better estimate in different scenarios.

Case 1 - Using previous day share prices and standard deviation.
In the example explained in this term paper the expression of tomorrow’s price is dependent
on yesterday’s prices and standard deviation of last 3 days.

Case 2 – Using previous day share price and gain of previous day.

It is generally known that share prices behave in momentum basis, for a period of time share
prices go up, then comes a period when prices goes down. This model takes advantage of this
behaviour of the stock prices.

So using the statistical techniques I will try to compare the model developed using case 1 and
case 2. It is widely accepted that the model developed using case 2 fits better than model
developed using case 1

DATA
Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have
used few files from her data. In both the cases I have used February 2008 share price data of
Tata Motors. Except the traded data rest all data is available in public domain.

The file contains the following items
   i)      Symbol,
   ii)     Series,
   iii)    Date,
   iv)     Prev Close,
   v)      Open Price,
   vi)     High Price,
   vii)    Low Price,
   viii)   Last Price,
   ix)     Close Price,
   x)      Average Price,
   xi)     Total Traded
   xii)    Quantity,
   xiii)   Turnover in Lacs,


This text file is available at this link- http://bit.ly/TM_PVD

4
CASE 1

The program first reads the file. Then it extracts the price data. It creates few matrixes for
prices of previous 3 days i.e. A, B and C. Then using for loop it finds the standard deviation of
prices of past 3 days. Then using linear modelling it tries to fit the model to predict future
prices.

Before running the case one thing we need to keep in mind is that we change the directory
location of R to the place where we have saved our text file. The packages required to run this
code are already installed in R so there is no need of adding any additional packages.


THE CODE


TFile<-"tatamotors.txt"
Trade<-read.table(TFile)

A <- Trade[,4]
B <- A[-1]
C <- B[-1]
l<-length(B)
B <- B[-l]
l<-length(A)
A <- A[-l]
l<-length(A)
A <- A[-l]
l<-length(A)

for (i in 1:l) D[i]= sd(D <-c(A[i],B[i],C[i]),na.rm = FALSE)
summary(lm(C~A+D))



THE RESULT

The result of the above code can be found on the following page (figure 1)




5
Figure 1 the output of case 1




INTERPRETATION OF THE RESULT

P value and F Stats shows that model is not able to predict the prices well.

                   Estimate     Std. Error   t value        Pr(>|t|)
(Intercept)        -1.891e+03   1.589e+03    -1.190         0.445
A                  3.694e+00    2.257e+00    1.637          0.349
D                  -9.471e-02   1.066e-01     -0.888        0.538



6
CASE 2

The program first reads the file. Then it extract the price data in matrix A.T hen using matrix B
and C it finds the gains for first n-1 days (where n is total number of days available. This data is
stored in matrix D. Now using linear model function one can find statistical significance of the
model.

We know that in this case correlation will be high so we use correlation flag of liner model as
true, this way the function gives better prediction by minimizing the auto correlation problem
from the data.


THE CODE


TFile<-"tatamotors.txt"
Trade<-read.table(TFile)
A <- Trade[,4]
l<-length(A)
B <- A[-1]
C <- A[-l]
D <- (C-B)*100/C
summary(lm(B~C+D), correlation=TRUE)


THE RESULT
The result of the above code can be found on the following page (figure 2)




7
Figure 2 The output of case 2




8
INTERPRETATION OF THE RESULT

               Estimate      Std. Error     t value        Pr(>|t|)
(Intercept)    -2.335684     2.392506       -0.976          0.333
C              1.002842      0.003261       307.518        <2e-16 ***
D                     -7.187997      0.042478       -169.215        <2e-16 ***

Here we see that significance of the model is very high. Also the Adjusted R square is high.
However adjusted R value also indicates auto correlations, which is very evident in this case.
But again the F-statistic analysis shows that model is able to predict share prices in better way.

So here we confirm our assumption that previous day gain models (case 2) fits better than
standard deviation model using R(case 1).




9
CLASSIFICATION
Classification is also known as decision trees. It’s basically an algorithm that creates a rule to
determine the output of a new data instance.

It creates a tree where each node represents attribute of our dataset. A decision is made at
these spots based on the input. By moving on from one to another node you reach at the end
of the tree which gives a predicted output.

This is illustrated using the following example

THE DATASET
The dataset used in this example was found on net. The data can be downloaded from the link
http://maya.cs.depaul.edu/classes/ect584/weka/data/bank-data.csv.

Let’s say there is a bank ABC. It has data of 600 people who have either opted for its product
or not. It has the following information of the people: age, gender, income, marital status region
and mortgage. Now bank can use this information to create a rule to predict whether a new
potential customer would opt for its product or not based on the known attributes of the
customer.

CLASSIFICATION PROCEDURE
Load the data in weka. To load data click on open file and specify the path. The window shown
in figure 3 should appear after loading.

One will note that there are 12 attributes in the dataset as seen in the attribute tab of the
window. For this example we will be using only the following attributes

Age, sex, region, income, married, mortgage, savings and product.

Here we will try to predict the response of new customer using the 7 attributes age, sex,
region, income, marriage, mortgage and savings

To remove the remaining attributes click on the checkbox on the left side of the attributes and
click on remove .After removing the attributes one should get the window as shown in figure 4.

Now click on the classify tab on the top. Under the classifier tab click on choose  trees 
J48 as shown in figure 5.

 J48 is an algorithm used to generate a decision tree developed by Ross Quinlan. It is an
extension of Quinlan's earlier ID3 algorithm. The decision trees generated by uses the concept
of information entropy.



10
Now we can create the model in WEKA. First ensure that training set is selected so the data
we have loaded is only used for creating the model. Click start. The output from this model will
look like as shown in figure 6.

INTERPRETING THE RESULTS

The important results to focus on are

     1. Correctly Classified Instances" (75.66 percent) and the "Incorrectly Classified Instances
        (24.33)” which tells us about the accuracy of the model. Our model is neither very good
        nor very bad. It’s Ok. Further modification needs to be done.
     2. Confusion matrix which shows number of false positives and negatives. Here in this case
        117 a are incorrectly classified as b and 29 b are incorrectly classified as a.
     3. The ROC area measures the discrimination ability of the forecast. Although there is
        some discrimination whenever the ROC area is > 0.5, in most situations the
        discrimination ability of the forecast is not really considered useful in practice unless the
        ROC area is > 0.7. For our model the value of ROC is greater then .7 (.787).
     4. The decision tree is the main output. It’s the rule that will help to predict the outcome
        of new data instances – To view the decision tree right-click on the model and
        select Visualize tree. You will get the window as shown in the figure7




11
Figure 3 Window after loading dataset




Figure 4 Window after removing unwanted attributes


12
Figure 5 Choosing the J48 tree




Figure 6 Output of the classification process

13
Figure 7The decision tree




14

Weitere ähnliche Inhalte

Was ist angesagt?

Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
ijdpsjournal
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Random forest algorithm for regression a beginner's guide
Random forest algorithm for regression   a beginner's guideRandom forest algorithm for regression   a beginner's guide
Random forest algorithm for regression a beginner's guide
prateek kumar
 

Was ist angesagt? (19)

Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Comparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News ArticlesComparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News Articles
 
Ijariie1129
Ijariie1129Ijariie1129
Ijariie1129
 
The D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High ConfidenceThe D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High Confidence
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Improved Frequent Itemset Generation Algorithm Based On Correspondence An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
 
Ijtra130516
Ijtra130516Ijtra130516
Ijtra130516
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
Data Structures problems 2006
Data Structures problems 2006Data Structures problems 2006
Data Structures problems 2006
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
B0950814
B0950814B0950814
B0950814
 
Random forest algorithm for regression a beginner's guide
Random forest algorithm for regression   a beginner's guideRandom forest algorithm for regression   a beginner's guide
Random forest algorithm for regression a beginner's guide
 

Andere mochten auch

Mis term paper
Mis term paperMis term paper
Mis term paper
rahulsm27
 
1. pc workshop bij aba
1. pc workshop bij aba1. pc workshop bij aba
1. pc workshop bij aba
Ceesh
 
Mis term paper
Mis term paperMis term paper
Mis term paper
rahulsm27
 
ITB_BeGraphic
ITB_BeGraphicITB_BeGraphic
ITB_BeGraphic
rahulsm27
 
Mis term paper
Mis term paperMis term paper
Mis term paper
rahulsm27
 
2. veiligsurfenophetinternet rpc aba
2. veiligsurfenophetinternet rpc aba2. veiligsurfenophetinternet rpc aba
2. veiligsurfenophetinternet rpc aba
Ceesh
 
Mis term paper
Mis term paperMis term paper
Mis term paper
rahulsm27
 
3. email voor beginners rpc aba
3. email voor beginners rpc aba3. email voor beginners rpc aba
3. email voor beginners rpc aba
Ceesh
 

Andere mochten auch (17)

Mobile Strategy
Mobile StrategyMobile Strategy
Mobile Strategy
 
Mis term paper
Mis term paperMis term paper
Mis term paper
 
1. pc workshop bij aba
1. pc workshop bij aba1. pc workshop bij aba
1. pc workshop bij aba
 
Museumkompas
MuseumkompasMuseumkompas
Museumkompas
 
Pot presentatie 2.1
Pot presentatie 2.1Pot presentatie 2.1
Pot presentatie 2.1
 
Workshop social media Creative Connection
Workshop social media Creative ConnectionWorkshop social media Creative Connection
Workshop social media Creative Connection
 
Mis term paper
Mis term paperMis term paper
Mis term paper
 
Configuring participation: On how we involve people in design
Configuring participation: On how we involve people in designConfiguring participation: On how we involve people in design
Configuring participation: On how we involve people in design
 
ITB_BeGraphic
ITB_BeGraphicITB_BeGraphic
ITB_BeGraphic
 
Philadelphia
PhiladelphiaPhiladelphia
Philadelphia
 
Npc duidelijk
Npc duidelijkNpc duidelijk
Npc duidelijk
 
Mis term paper
Mis term paperMis term paper
Mis term paper
 
CSC8605 Session 1 (Slides 2)
CSC8605 Session 1 (Slides 2)CSC8605 Session 1 (Slides 2)
CSC8605 Session 1 (Slides 2)
 
2. veiligsurfenophetinternet rpc aba
2. veiligsurfenophetinternet rpc aba2. veiligsurfenophetinternet rpc aba
2. veiligsurfenophetinternet rpc aba
 
Mis term paper
Mis term paperMis term paper
Mis term paper
 
3. email voor beginners rpc aba
3. email voor beginners rpc aba3. email voor beginners rpc aba
3. email voor beginners rpc aba
 
Co2
Co2Co2
Co2
 

Ähnlich wie ITB Term Paper - 10BM60066

Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
Gon-soo Moon
 
Open06
Open06Open06
Open06
butest
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
mayurik19
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
Rachit Mishra
 
PorfolioReport
PorfolioReportPorfolioReport
PorfolioReport
Albert Chu
 

Ähnlich wie ITB Term Paper - 10BM60066 (20)

Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
Open06
Open06Open06
Open06
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
 
IRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and PredictionIRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and Prediction
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
 
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
Applying Deep Learning to Enhance Momentum Trading Strategies in StocksApplying Deep Learning to Enhance Momentum Trading Strategies in Stocks
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
Data Mining _ Weka
Data Mining _ WekaData Mining _ Weka
Data Mining _ Weka
 
CAR EVALUATION DATABASE
CAR EVALUATION DATABASECAR EVALUATION DATABASE
CAR EVALUATION DATABASE
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend Analysis
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
 
Bitcoin Price Prediction Using LSTM
Bitcoin Price Prediction Using LSTMBitcoin Price Prediction Using LSTM
Bitcoin Price Prediction Using LSTM
 
Algorithmic Trading Deutsche Borse Public Dataset
Algorithmic Trading Deutsche Borse Public DatasetAlgorithmic Trading Deutsche Borse Public Dataset
Algorithmic Trading Deutsche Borse Public Dataset
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
PorfolioReport
PorfolioReportPorfolioReport
PorfolioReport
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classication
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

ITB Term Paper - 10BM60066

  • 1. ITB TERM PAPER DATA MINING TECHNIQUES (LINEAR MODELLING AND CLASSIFICATION) RAHUL MAHAJAN (10BM60066)
  • 2. Table of Contents INTRODUCTION ............................................................................................................................................. 3 ABOUT WEKA ............................................................................................................................................ 3 ABOUT R .................................................................................................................................................... 3 LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE ..................................... 4 DATA ......................................................................................................................................................... 4 CASE 1 ........................................................................................................................................................... 5 THE CODE .................................................................................................................................................. 5 THE RESULT ............................................................................................................................................... 5 INTERPRETATION OF THE RESULT............................................................................................................. 6 CASE 2 ........................................................................................................................................................... 7 THE CODE .................................................................................................................................................. 7 THE RESULT ............................................................................................................................................... 7 INTERPRETATION OF THE RESULT............................................................................................................. 9 CLASSIFICATION .......................................................................................................................................... 10 THE DATASET .......................................................................................................................................... 10 CLASSIFICATION PROCEDURE ................................................................................................................. 10 INTERPRETING THE RESULTS .................................................................................................................. 11 2
  • 3. INTRODUCTION In this term paper I have demonstrated two data mining techniques  LINEAR MODELLING TECHNIQUE o The linear modelling technique is demonstrated using R.  CLASSIFICATION o The classification technique is demonstrated using WEKA ABOUT WEKA Weka is java based collection of open source of many data mining and machine learning algorithms, including o Pre-processing on data o Classification: o Clustering o Association rule extraction ABOUT R R is an open source programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software and data analysis 3
  • 4. LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE Here I will try to use GARCH model to predict future share price. GARCH Models gives us liberty to define model using previous share prices and volatility for a defined period. There are many versions of GARCH Models to give better estimate in different scenarios. Case 1 - Using previous day share prices and standard deviation. In the example explained in this term paper the expression of tomorrow’s price is dependent on yesterday’s prices and standard deviation of last 3 days. Case 2 – Using previous day share price and gain of previous day. It is generally known that share prices behave in momentum basis, for a period of time share prices go up, then comes a period when prices goes down. This model takes advantage of this behaviour of the stock prices. So using the statistical techniques I will try to compare the model developed using case 1 and case 2. It is widely accepted that the model developed using case 2 fits better than model developed using case 1 DATA Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have used few files from her data. In both the cases I have used February 2008 share price data of Tata Motors. Except the traded data rest all data is available in public domain. The file contains the following items i) Symbol, ii) Series, iii) Date, iv) Prev Close, v) Open Price, vi) High Price, vii) Low Price, viii) Last Price, ix) Close Price, x) Average Price, xi) Total Traded xii) Quantity, xiii) Turnover in Lacs, This text file is available at this link- http://bit.ly/TM_PVD 4
  • 5. CASE 1 The program first reads the file. Then it extracts the price data. It creates few matrixes for prices of previous 3 days i.e. A, B and C. Then using for loop it finds the standard deviation of prices of past 3 days. Then using linear modelling it tries to fit the model to predict future prices. Before running the case one thing we need to keep in mind is that we change the directory location of R to the place where we have saved our text file. The packages required to run this code are already installed in R so there is no need of adding any additional packages. THE CODE TFile<-"tatamotors.txt" Trade<-read.table(TFile) A <- Trade[,4] B <- A[-1] C <- B[-1] l<-length(B) B <- B[-l] l<-length(A) A <- A[-l] l<-length(A) A <- A[-l] l<-length(A) for (i in 1:l) D[i]= sd(D <-c(A[i],B[i],C[i]),na.rm = FALSE) summary(lm(C~A+D)) THE RESULT The result of the above code can be found on the following page (figure 1) 5
  • 6. Figure 1 the output of case 1 INTERPRETATION OF THE RESULT P value and F Stats shows that model is not able to predict the prices well. Estimate Std. Error t value Pr(>|t|) (Intercept) -1.891e+03 1.589e+03 -1.190 0.445 A 3.694e+00 2.257e+00 1.637 0.349 D -9.471e-02 1.066e-01 -0.888 0.538 6
  • 7. CASE 2 The program first reads the file. Then it extract the price data in matrix A.T hen using matrix B and C it finds the gains for first n-1 days (where n is total number of days available. This data is stored in matrix D. Now using linear model function one can find statistical significance of the model. We know that in this case correlation will be high so we use correlation flag of liner model as true, this way the function gives better prediction by minimizing the auto correlation problem from the data. THE CODE TFile<-"tatamotors.txt" Trade<-read.table(TFile) A <- Trade[,4] l<-length(A) B <- A[-1] C <- A[-l] D <- (C-B)*100/C summary(lm(B~C+D), correlation=TRUE) THE RESULT The result of the above code can be found on the following page (figure 2) 7
  • 8. Figure 2 The output of case 2 8
  • 9. INTERPRETATION OF THE RESULT Estimate Std. Error t value Pr(>|t|) (Intercept) -2.335684 2.392506 -0.976 0.333 C 1.002842 0.003261 307.518 <2e-16 *** D -7.187997 0.042478 -169.215 <2e-16 *** Here we see that significance of the model is very high. Also the Adjusted R square is high. However adjusted R value also indicates auto correlations, which is very evident in this case. But again the F-statistic analysis shows that model is able to predict share prices in better way. So here we confirm our assumption that previous day gain models (case 2) fits better than standard deviation model using R(case 1). 9
  • 10. CLASSIFICATION Classification is also known as decision trees. It’s basically an algorithm that creates a rule to determine the output of a new data instance. It creates a tree where each node represents attribute of our dataset. A decision is made at these spots based on the input. By moving on from one to another node you reach at the end of the tree which gives a predicted output. This is illustrated using the following example THE DATASET The dataset used in this example was found on net. The data can be downloaded from the link http://maya.cs.depaul.edu/classes/ect584/weka/data/bank-data.csv. Let’s say there is a bank ABC. It has data of 600 people who have either opted for its product or not. It has the following information of the people: age, gender, income, marital status region and mortgage. Now bank can use this information to create a rule to predict whether a new potential customer would opt for its product or not based on the known attributes of the customer. CLASSIFICATION PROCEDURE Load the data in weka. To load data click on open file and specify the path. The window shown in figure 3 should appear after loading. One will note that there are 12 attributes in the dataset as seen in the attribute tab of the window. For this example we will be using only the following attributes Age, sex, region, income, married, mortgage, savings and product. Here we will try to predict the response of new customer using the 7 attributes age, sex, region, income, marriage, mortgage and savings To remove the remaining attributes click on the checkbox on the left side of the attributes and click on remove .After removing the attributes one should get the window as shown in figure 4. Now click on the classify tab on the top. Under the classifier tab click on choose  trees  J48 as shown in figure 5. J48 is an algorithm used to generate a decision tree developed by Ross Quinlan. It is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by uses the concept of information entropy. 10
  • 11. Now we can create the model in WEKA. First ensure that training set is selected so the data we have loaded is only used for creating the model. Click start. The output from this model will look like as shown in figure 6. INTERPRETING THE RESULTS The important results to focus on are 1. Correctly Classified Instances" (75.66 percent) and the "Incorrectly Classified Instances (24.33)” which tells us about the accuracy of the model. Our model is neither very good nor very bad. It’s Ok. Further modification needs to be done. 2. Confusion matrix which shows number of false positives and negatives. Here in this case 117 a are incorrectly classified as b and 29 b are incorrectly classified as a. 3. The ROC area measures the discrimination ability of the forecast. Although there is some discrimination whenever the ROC area is > 0.5, in most situations the discrimination ability of the forecast is not really considered useful in practice unless the ROC area is > 0.7. For our model the value of ROC is greater then .7 (.787). 4. The decision tree is the main output. It’s the rule that will help to predict the outcome of new data instances – To view the decision tree right-click on the model and select Visualize tree. You will get the window as shown in the figure7 11
  • 12. Figure 3 Window after loading dataset Figure 4 Window after removing unwanted attributes 12
  • 13. Figure 5 Choosing the J48 tree Figure 6 Output of the classification process 13