SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
ITB TERM PAPER




DATA MINING TECHNIQUES
(LINEAR MODELLING AND CLASSIFICATION)

RAHUL MAHAJAN (10BM60066)
Table of Contents
INTRODUCTION ............................................................................................................................................. 3
    ABOUT WEKA ............................................................................................................................................ 3
    ABOUT R .................................................................................................................................................... 3
LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE ..................................... 4
    DATA ......................................................................................................................................................... 4
CASE 1 ........................................................................................................................................................... 5
    THE CODE .................................................................................................................................................. 5
    THE RESULT ............................................................................................................................................... 5
    INTERPRETATION OF THE RESULT............................................................................................................. 6
CASE 2 ........................................................................................................................................................... 7
    THE CODE .................................................................................................................................................. 7
    THE RESULT ............................................................................................................................................... 7
    INTERPRETATION OF THE RESULT............................................................................................................. 9
CLASSIFICATION .......................................................................................................................................... 10
    THE DATASET .......................................................................................................................................... 10
    CLASSIFICATION PROCEDURE ................................................................................................................. 10
    INTERPRETING THE RESULTS .................................................................................................................. 11




2
INTRODUCTION
In this term paper I have demonstrated two data mining techniques

       LINEAR MODELLING TECHNIQUE
           o The linear modelling technique is demonstrated using R.
       CLASSIFICATION
           o The classification technique is demonstrated using WEKA


ABOUT WEKA
Weka is java based collection of open source of many data mining and machine learning
algorithms, including
           o Pre-processing on data
           o Classification:
           o Clustering
           o Association rule extraction


ABOUT R
R is an open source programming language and software environment for statistical
computing and graphics. The R language is widely used among statisticians for developing
statistical software and data analysis




3
LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF
FUTURE SHAR PRICE

Here I will try to use GARCH model to predict future share price. GARCH Models gives us
liberty to define model using previous share prices and volatility for a defined period. There are
many versions of GARCH Models to give better estimate in different scenarios.

Case 1 - Using previous day share prices and standard deviation.
In the example explained in this term paper the expression of tomorrow’s price is dependent
on yesterday’s prices and standard deviation of last 3 days.

Case 2 – Using previous day share price and gain of previous day.

It is generally known that share prices behave in momentum basis, for a period of time share
prices go up, then comes a period when prices goes down. This model takes advantage of this
behaviour of the stock prices.

So using the statistical techniques I will try to compare the model developed using case 1 and
case 2. It is widely accepted that the model developed using case 2 fits better than model
developed using case 1

DATA
Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have
used few files from her data. In both the cases I have used February 2008 share price data of
Tata Motors. Except the traded data rest all data is available in public domain.

The file contains the following items
   i)      Symbol,
   ii)     Series,
   iii)    Date,
   iv)     Prev Close,
   v)      Open Price,
   vi)     High Price,
   vii)    Low Price,
   viii)   Last Price,
   ix)     Close Price,
   x)      Average Price,
   xi)     Total Traded
   xii)    Quantity,
   xiii)   Turnover in Lacs,


This text file is available at this link- http://bit.ly/TM_PVD

4
CASE 1

The program first reads the file. Then it extracts the price data. It creates few matrixes for
prices of previous 3 days i.e. A, B and C. Then using for loop it finds the standard deviation of
prices of past 3 days. Then using linear modelling it tries to fit the model to predict future
prices.

Before running the case one thing we need to keep in mind is that we change the directory
location of R to the place where we have saved our text file. The packages required to run this
code are already installed in R so there is no need of adding any additional packages.


THE CODE


TFile<-"tatamotors.txt"
Trade<-read.table(TFile)

A <- Trade[,4]
B <- A[-1]
C <- B[-1]
l<-length(B)
B <- B[-l]
l<-length(A)
A <- A[-l]
l<-length(A)
A <- A[-l]
l<-length(A)

for (i in 1:l) D[i]= sd(D <-c(A[i],B[i],C[i]),na.rm = FALSE)
summary(lm(C~A+D))



THE RESULT

The result of the above code can be found on the following page (figure 1)




5
Figure 1 the output of case 1




INTERPRETATION OF THE RESULT

P value and F Stats shows that model is not able to predict the prices well.

                   Estimate     Std. Error   t value        Pr(>|t|)
(Intercept)        -1.891e+03   1.589e+03    -1.190         0.445
A                  3.694e+00    2.257e+00    1.637          0.349
D                  -9.471e-02   1.066e-01     -0.888        0.538



6
CASE 2

The program first reads the file. Then it extract the price data in matrix A.T hen using matrix B
and C it finds the gains for first n-1 days (where n is total number of days available. This data is
stored in matrix D. Now using linear model function one can find statistical significance of the
model.

We know that in this case correlation will be high so we use correlation flag of liner model as
true, this way the function gives better prediction by minimizing the auto correlation problem
from the data.


THE CODE


TFile<-"tatamotors.txt"
Trade<-read.table(TFile)
A <- Trade[,4]
l<-length(A)
B <- A[-1]
C <- A[-l]
D <- (C-B)*100/C
summary(lm(B~C+D), correlation=TRUE)


THE RESULT
The result of the above code can be found on the following page (figure 2)




7
Figure 2 The output of case 2




8
INTERPRETATION OF THE RESULT

               Estimate      Std. Error     t value        Pr(>|t|)
(Intercept)    -2.335684     2.392506       -0.976          0.333
C              1.002842      0.003261       307.518        <2e-16 ***
D                     -7.187997      0.042478       -169.215        <2e-16 ***

Here we see that significance of the model is very high. Also the Adjusted R square is high.
However adjusted R value also indicates auto correlations, which is very evident in this case.
But again the F-statistic analysis shows that model is able to predict share prices in better way.

So here we confirm our assumption that previous day gain models (case 2) fits better than
standard deviation model using R(case 1).




9
CLASSIFICATION
Classification is also known as decision trees. It’s basically an algorithm that creates a rule to
determine the output of a new data instance.

It creates a tree where each node represents attribute of our dataset. A decision is made at
these spots based on the input. By moving on from one to another node you reach at the end
of the tree which gives a predicted output.

This is illustrated using the following example

THE DATASET
The dataset used in this example was found on net. The data can be downloaded from the link
http://maya.cs.depaul.edu/classes/ect584/weka/data/bank-data.csv.

Let’s say there is a bank ABC. It has data of 600 people who have either opted for its product
or not. It has the following information of the people: age, gender, income, marital status region
and mortgage. Now bank can use this information to create a rule to predict whether a new
potential customer would opt for its product or not based on the known attributes of the
customer.

CLASSIFICATION PROCEDURE
Load the data in weka. To load data click on open file and specify the path. The window shown
in figure 3 should appear after loading.

One will note that there are 12 attributes in the dataset as seen in the attribute tab of the
window. For this example we will be using only the following attributes

Age, sex, region, income, married, mortgage, savings and product.

Here we will try to predict the response of new customer using the 7 attributes age, sex,
region, income, marriage, mortgage and savings

To remove the remaining attributes click on the checkbox on the left side of the attributes and
click on remove .After removing the attributes one should get the window as shown in figure 4.

Now click on the classify tab on the top. Under the classifier tab click on choose  trees 
J48 as shown in figure 5.

 J48 is an algorithm used to generate a decision tree developed by Ross Quinlan. It is an
extension of Quinlan's earlier ID3 algorithm. The decision trees generated by uses the concept
of information entropy.



10
Now we can create the model in WEKA. First ensure that training set is selected so the data
we have loaded is only used for creating the model. Click start. The output from this model will
look like as shown in figure 6.

INTERPRETING THE RESULTS

The important results to focus on are

     1. Correctly Classified Instances" (75.66 percent) and the "Incorrectly Classified Instances
        (24.33)” which tells us about the accuracy of the model. Our model is neither very good
        nor very bad. It’s Ok. Further modification needs to be done.
     2. Confusion matrix which shows number of false positives and negatives. Here in this case
        117 a are incorrectly classified as b and 29 b are incorrectly classified as a.
     3. The ROC area measures the discrimination ability of the forecast. Although there is
        some discrimination whenever the ROC area is > 0.5, in most situations the
        discrimination ability of the forecast is not really considered useful in practice unless the
        ROC area is > 0.7. For our model the value of ROC is greater then .7 (.787).
     4. The decision tree is the main output. It’s the rule that will help to predict the outcome
        of new data instances – To view the decision tree right-click on the model and
        select Visualize tree. You will get the window as shown in the figure7




11
Figure 3 Window after loading dataset




Figure 4 Window after removing unwanted attributes


12
Figure 5 Choosing the J48 tree




Figure 6 Output of the classification process

13
Figure 7The decision tree




14

Weitere ähnliche Inhalte

Was ist angesagt?

Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Comparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News ArticlesComparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News ArticlesIRJET Journal
 
The D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High ConfidenceThe D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High ConfidenceITIIIndustries
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dwANUSUYA T K
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Improved Frequent Itemset Generation Algorithm Based On Correspondence An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Improved Frequent Itemset Generation Algorithm Based On Correspondence cscpconf
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 
Data Structures problems 2006
Data Structures problems 2006Data Structures problems 2006
Data Structures problems 2006Sanjay Goel
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysisinventy
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
 
Random forest algorithm for regression a beginner's guide
Random forest algorithm for regression   a beginner's guideRandom forest algorithm for regression   a beginner's guide
Random forest algorithm for regression a beginner's guideprateek kumar
 

Was ist angesagt? (19)

Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Comparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News ArticlesComparison of Text Classifiers on News Articles
Comparison of Text Classifiers on News Articles
 
Ijariie1129
Ijariie1129Ijariie1129
Ijariie1129
 
The D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High ConfidenceThe D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High Confidence
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Improved Frequent Itemset Generation Algorithm Based On Correspondence An Improved Frequent Itemset Generation Algorithm Based On Correspondence
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
 
Ijtra130516
Ijtra130516Ijtra130516
Ijtra130516
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
Data Structures problems 2006
Data Structures problems 2006Data Structures problems 2006
Data Structures problems 2006
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
B0950814
B0950814B0950814
B0950814
 
Random forest algorithm for regression a beginner's guide
Random forest algorithm for regression   a beginner's guideRandom forest algorithm for regression   a beginner's guide
Random forest algorithm for regression a beginner's guide
 

Andere mochten auch

Mis term paper
Mis term paperMis term paper
Mis term paperrahulsm27
 
1. pc workshop bij aba
1. pc workshop bij aba1. pc workshop bij aba
1. pc workshop bij abaCeesh
 
Pot presentatie 2.1
Pot presentatie 2.1Pot presentatie 2.1
Pot presentatie 2.1Ceesh
 
Workshop social media Creative Connection
Workshop social media Creative ConnectionWorkshop social media Creative Connection
Workshop social media Creative Connectiondennisringersma
 
Mis term paper
Mis term paperMis term paper
Mis term paperrahulsm27
 
Configuring participation: On how we involve people in design
Configuring participation: On how we involve people in designConfiguring participation: On how we involve people in design
Configuring participation: On how we involve people in designJohn Vines
 
ITB_BeGraphic
ITB_BeGraphicITB_BeGraphic
ITB_BeGraphicrahulsm27
 
Npc duidelijk
Npc duidelijkNpc duidelijk
Npc duidelijkCeesh
 
Mis term paper
Mis term paperMis term paper
Mis term paperrahulsm27
 
CSC8605 Session 1 (Slides 2)
CSC8605 Session 1 (Slides 2)CSC8605 Session 1 (Slides 2)
CSC8605 Session 1 (Slides 2)John Vines
 
2. veiligsurfenophetinternet rpc aba
2. veiligsurfenophetinternet rpc aba2. veiligsurfenophetinternet rpc aba
2. veiligsurfenophetinternet rpc abaCeesh
 
Mis term paper
Mis term paperMis term paper
Mis term paperrahulsm27
 
3. email voor beginners rpc aba
3. email voor beginners rpc aba3. email voor beginners rpc aba
3. email voor beginners rpc abaCeesh
 

Andere mochten auch (17)

Mobile Strategy
Mobile StrategyMobile Strategy
Mobile Strategy
 
Mis term paper
Mis term paperMis term paper
Mis term paper
 
1. pc workshop bij aba
1. pc workshop bij aba1. pc workshop bij aba
1. pc workshop bij aba
 
Museumkompas
MuseumkompasMuseumkompas
Museumkompas
 
Pot presentatie 2.1
Pot presentatie 2.1Pot presentatie 2.1
Pot presentatie 2.1
 
Workshop social media Creative Connection
Workshop social media Creative ConnectionWorkshop social media Creative Connection
Workshop social media Creative Connection
 
Mis term paper
Mis term paperMis term paper
Mis term paper
 
Configuring participation: On how we involve people in design
Configuring participation: On how we involve people in designConfiguring participation: On how we involve people in design
Configuring participation: On how we involve people in design
 
ITB_BeGraphic
ITB_BeGraphicITB_BeGraphic
ITB_BeGraphic
 
Philadelphia
PhiladelphiaPhiladelphia
Philadelphia
 
Npc duidelijk
Npc duidelijkNpc duidelijk
Npc duidelijk
 
Mis term paper
Mis term paperMis term paper
Mis term paper
 
CSC8605 Session 1 (Slides 2)
CSC8605 Session 1 (Slides 2)CSC8605 Session 1 (Slides 2)
CSC8605 Session 1 (Slides 2)
 
2. veiligsurfenophetinternet rpc aba
2. veiligsurfenophetinternet rpc aba2. veiligsurfenophetinternet rpc aba
2. veiligsurfenophetinternet rpc aba
 
Mis term paper
Mis term paperMis term paper
Mis term paper
 
3. email voor beginners rpc aba
3. email voor beginners rpc aba3. email voor beginners rpc aba
3. email voor beginners rpc aba
 
Co2
Co2Co2
Co2
 

Ähnlich wie ITB Term Paper - 10BM60066

Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
Open06
Open06Open06
Open06butest
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET Journal
 
IRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and PredictionIRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and PredictionIRJET Journal
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemMasaharu Kinoshita
 
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
Applying Deep Learning to Enhance Momentum Trading Strategies in StocksApplying Deep Learning to Enhance Momentum Trading Strategies in Stocks
Applying Deep Learning to Enhance Momentum Trading Strategies in StocksLawrence Takeuchi
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar reportmayurik19
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisIRJET Journal
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra
 
Bitcoin Price Prediction Using LSTM
Bitcoin Price Prediction Using LSTMBitcoin Price Prediction Using LSTM
Bitcoin Price Prediction Using LSTMIRJET Journal
 
Algorithmic Trading Deutsche Borse Public Dataset
Algorithmic Trading Deutsche Borse Public DatasetAlgorithmic Trading Deutsche Borse Public Dataset
Algorithmic Trading Deutsche Borse Public DatasetMarjan Ahmed
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsIRJET Journal
 
PorfolioReport
PorfolioReportPorfolioReport
PorfolioReportAlbert Chu
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationIJECEIAES
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 

Ähnlich wie ITB Term Paper - 10BM60066 (20)

Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
Open06
Open06Open06
Open06
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
 
IRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and PredictionIRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and Prediction
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
 
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
Applying Deep Learning to Enhance Momentum Trading Strategies in StocksApplying Deep Learning to Enhance Momentum Trading Strategies in Stocks
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
Data Mining _ Weka
Data Mining _ WekaData Mining _ Weka
Data Mining _ Weka
 
CAR EVALUATION DATABASE
CAR EVALUATION DATABASECAR EVALUATION DATABASE
CAR EVALUATION DATABASE
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend Analysis
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
 
Bitcoin Price Prediction Using LSTM
Bitcoin Price Prediction Using LSTMBitcoin Price Prediction Using LSTM
Bitcoin Price Prediction Using LSTM
 
Algorithmic Trading Deutsche Borse Public Dataset
Algorithmic Trading Deutsche Borse Public DatasetAlgorithmic Trading Deutsche Borse Public Dataset
Algorithmic Trading Deutsche Borse Public Dataset
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
PorfolioReport
PorfolioReportPorfolioReport
PorfolioReport
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classication
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 

Kürzlich hochgeladen

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

ITB Term Paper - 10BM60066

  • 1. ITB TERM PAPER DATA MINING TECHNIQUES (LINEAR MODELLING AND CLASSIFICATION) RAHUL MAHAJAN (10BM60066)
  • 2. Table of Contents INTRODUCTION ............................................................................................................................................. 3 ABOUT WEKA ............................................................................................................................................ 3 ABOUT R .................................................................................................................................................... 3 LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE ..................................... 4 DATA ......................................................................................................................................................... 4 CASE 1 ........................................................................................................................................................... 5 THE CODE .................................................................................................................................................. 5 THE RESULT ............................................................................................................................................... 5 INTERPRETATION OF THE RESULT............................................................................................................. 6 CASE 2 ........................................................................................................................................................... 7 THE CODE .................................................................................................................................................. 7 THE RESULT ............................................................................................................................................... 7 INTERPRETATION OF THE RESULT............................................................................................................. 9 CLASSIFICATION .......................................................................................................................................... 10 THE DATASET .......................................................................................................................................... 10 CLASSIFICATION PROCEDURE ................................................................................................................. 10 INTERPRETING THE RESULTS .................................................................................................................. 11 2
  • 3. INTRODUCTION In this term paper I have demonstrated two data mining techniques  LINEAR MODELLING TECHNIQUE o The linear modelling technique is demonstrated using R.  CLASSIFICATION o The classification technique is demonstrated using WEKA ABOUT WEKA Weka is java based collection of open source of many data mining and machine learning algorithms, including o Pre-processing on data o Classification: o Clustering o Association rule extraction ABOUT R R is an open source programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software and data analysis 3
  • 4. LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE Here I will try to use GARCH model to predict future share price. GARCH Models gives us liberty to define model using previous share prices and volatility for a defined period. There are many versions of GARCH Models to give better estimate in different scenarios. Case 1 - Using previous day share prices and standard deviation. In the example explained in this term paper the expression of tomorrow’s price is dependent on yesterday’s prices and standard deviation of last 3 days. Case 2 – Using previous day share price and gain of previous day. It is generally known that share prices behave in momentum basis, for a period of time share prices go up, then comes a period when prices goes down. This model takes advantage of this behaviour of the stock prices. So using the statistical techniques I will try to compare the model developed using case 1 and case 2. It is widely accepted that the model developed using case 2 fits better than model developed using case 1 DATA Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have used few files from her data. In both the cases I have used February 2008 share price data of Tata Motors. Except the traded data rest all data is available in public domain. The file contains the following items i) Symbol, ii) Series, iii) Date, iv) Prev Close, v) Open Price, vi) High Price, vii) Low Price, viii) Last Price, ix) Close Price, x) Average Price, xi) Total Traded xii) Quantity, xiii) Turnover in Lacs, This text file is available at this link- http://bit.ly/TM_PVD 4
  • 5. CASE 1 The program first reads the file. Then it extracts the price data. It creates few matrixes for prices of previous 3 days i.e. A, B and C. Then using for loop it finds the standard deviation of prices of past 3 days. Then using linear modelling it tries to fit the model to predict future prices. Before running the case one thing we need to keep in mind is that we change the directory location of R to the place where we have saved our text file. The packages required to run this code are already installed in R so there is no need of adding any additional packages. THE CODE TFile<-"tatamotors.txt" Trade<-read.table(TFile) A <- Trade[,4] B <- A[-1] C <- B[-1] l<-length(B) B <- B[-l] l<-length(A) A <- A[-l] l<-length(A) A <- A[-l] l<-length(A) for (i in 1:l) D[i]= sd(D <-c(A[i],B[i],C[i]),na.rm = FALSE) summary(lm(C~A+D)) THE RESULT The result of the above code can be found on the following page (figure 1) 5
  • 6. Figure 1 the output of case 1 INTERPRETATION OF THE RESULT P value and F Stats shows that model is not able to predict the prices well. Estimate Std. Error t value Pr(>|t|) (Intercept) -1.891e+03 1.589e+03 -1.190 0.445 A 3.694e+00 2.257e+00 1.637 0.349 D -9.471e-02 1.066e-01 -0.888 0.538 6
  • 7. CASE 2 The program first reads the file. Then it extract the price data in matrix A.T hen using matrix B and C it finds the gains for first n-1 days (where n is total number of days available. This data is stored in matrix D. Now using linear model function one can find statistical significance of the model. We know that in this case correlation will be high so we use correlation flag of liner model as true, this way the function gives better prediction by minimizing the auto correlation problem from the data. THE CODE TFile<-"tatamotors.txt" Trade<-read.table(TFile) A <- Trade[,4] l<-length(A) B <- A[-1] C <- A[-l] D <- (C-B)*100/C summary(lm(B~C+D), correlation=TRUE) THE RESULT The result of the above code can be found on the following page (figure 2) 7
  • 8. Figure 2 The output of case 2 8
  • 9. INTERPRETATION OF THE RESULT Estimate Std. Error t value Pr(>|t|) (Intercept) -2.335684 2.392506 -0.976 0.333 C 1.002842 0.003261 307.518 <2e-16 *** D -7.187997 0.042478 -169.215 <2e-16 *** Here we see that significance of the model is very high. Also the Adjusted R square is high. However adjusted R value also indicates auto correlations, which is very evident in this case. But again the F-statistic analysis shows that model is able to predict share prices in better way. So here we confirm our assumption that previous day gain models (case 2) fits better than standard deviation model using R(case 1). 9
  • 10. CLASSIFICATION Classification is also known as decision trees. It’s basically an algorithm that creates a rule to determine the output of a new data instance. It creates a tree where each node represents attribute of our dataset. A decision is made at these spots based on the input. By moving on from one to another node you reach at the end of the tree which gives a predicted output. This is illustrated using the following example THE DATASET The dataset used in this example was found on net. The data can be downloaded from the link http://maya.cs.depaul.edu/classes/ect584/weka/data/bank-data.csv. Let’s say there is a bank ABC. It has data of 600 people who have either opted for its product or not. It has the following information of the people: age, gender, income, marital status region and mortgage. Now bank can use this information to create a rule to predict whether a new potential customer would opt for its product or not based on the known attributes of the customer. CLASSIFICATION PROCEDURE Load the data in weka. To load data click on open file and specify the path. The window shown in figure 3 should appear after loading. One will note that there are 12 attributes in the dataset as seen in the attribute tab of the window. For this example we will be using only the following attributes Age, sex, region, income, married, mortgage, savings and product. Here we will try to predict the response of new customer using the 7 attributes age, sex, region, income, marriage, mortgage and savings To remove the remaining attributes click on the checkbox on the left side of the attributes and click on remove .After removing the attributes one should get the window as shown in figure 4. Now click on the classify tab on the top. Under the classifier tab click on choose  trees  J48 as shown in figure 5. J48 is an algorithm used to generate a decision tree developed by Ross Quinlan. It is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by uses the concept of information entropy. 10
  • 11. Now we can create the model in WEKA. First ensure that training set is selected so the data we have loaded is only used for creating the model. Click start. The output from this model will look like as shown in figure 6. INTERPRETING THE RESULTS The important results to focus on are 1. Correctly Classified Instances" (75.66 percent) and the "Incorrectly Classified Instances (24.33)” which tells us about the accuracy of the model. Our model is neither very good nor very bad. It’s Ok. Further modification needs to be done. 2. Confusion matrix which shows number of false positives and negatives. Here in this case 117 a are incorrectly classified as b and 29 b are incorrectly classified as a. 3. The ROC area measures the discrimination ability of the forecast. Although there is some discrimination whenever the ROC area is > 0.5, in most situations the discrimination ability of the forecast is not really considered useful in practice unless the ROC area is > 0.7. For our model the value of ROC is greater then .7 (.787). 4. The decision tree is the main output. It’s the rule that will help to predict the outcome of new data instances – To view the decision tree right-click on the model and select Visualize tree. You will get the window as shown in the figure7 11
  • 12. Figure 3 Window after loading dataset Figure 4 Window after removing unwanted attributes 12
  • 13. Figure 5 Choosing the J48 tree Figure 6 Output of the classification process 13