SlideShare a Scribd company logo
1 of 19
Download to read offline
DATA MINING CA-02 PAGE 2
Contents
Abstract .......................................................................................................................................................................4
Introduction .................................................................................................................................................................. 4
The CRISP-DM model ...................................................................................................................................................5
1.Business Understanding.....................................................................................................................................6
1.1 Business Objectives...............................................................................................................................................5
1.2 Stake holders for this project.............................................................................................................................. 6
1.3 Benefits ....................................................................................................................................... 7
1.4 Business constraints .................................................................................................................... 7
2 Data understanding ...................................................................................................................... 7
2.1 About the dataset .........................................................................................................................8
2.2 EDA .............................................................................................................................................8
3 Data preparation ......................................................................................................................... 13
3.1 Replace missing value ................................................................................................................ 13
3.2 Feature selection........................................................................................................................ 13
3.3 Data splitting ..............................................................................................................................15
4 Modeling........................................................................................................................................ 16
4.1 Auto model................................................................................................................................. 16
5 Evaluation ..................................................................................................................................... 19
6 Deployment...................................................................................................................................20
6.1 Business outcomes......................................................................................................................20
Conclusion .......................................................................................................................................20
IMAGES
[Reference model].................................................................................................................................6
[Attribute of dataset]............................................................................................................................. 7
[data types & Column Name]................................................................................................................8
[Missing value] ................................................................................................................................... 12
[Select attribute] ..................................................................................................................................15
[data splitting] .....................................................................................................................................15
[Model] ............................................................................................................................................... 18
[Auto model]....................................................................................................................................... 18
[Gradient boosted tree]....................................................................................................................... 19
[Evaluation]........................................................................................................................................ 19
[Evaluation python]............................................................................................................................ 19
[Result of prediction]..........................................................................................................................20
DATA MINING CA-02 PAGE 3
Buying or built a perfect house is a
lifetime goal for every human
being. However, most of the
people did wrong while purchase a
property. Likewise, many Housing
agents provide wrong information
to their client about the price.
Finally, many people end up
buying worthless property with
high value.
Aim of this project is to predict the
housing price and evaluate
insights over price distribution in
California housing data set. By the
end of this project we will be able
to know the right price range
distribution which depend over
various factor. However, this
project can help both clients and
real estate agents to choose right
price fluctuation.
DATA MINING CA-02 PAGE 4
PROJECT REPORT
Abstract:
A place to stay or house is a basic need of every individual of the world whether it comes to human
or animal. However, it has been seen that lots of individual are being homeless. As they cannot
afford a perfect house. If we take America’s California state which known as the land of golden
dreams. Now it becomes peoples worst housing nightmare. As the median house range is about
$600,000 for two people. Likewise, if we investigate Ireland housing crisis then it seems very worst
in Dublin, a leading tech hub of the country. As the house price rising dramatically the whole world
is suffering with this crisis. What is the fault behind this issue? Now everyone criticize government
who fail to protect the house price. However, we can propose different process to tackle this
problem and predict the price range distribution. Which can be helpful for both customer and both
real estate agents. We have proposed CRISP-DM model and RapidMiner tool to predict price
distribution.
Introduction:
Increasing population in the world people
looking to buy new house as per their
budget seem to be conservative and need
more market strategy for house agents. As
house price increase dramatically every
year there should be a system to predict
new house price according to the demand
of people for house size, Bedroom size and
location. This could merely help real estate
agents to decide house price for their
clients. There are several methods
proposed to determine house price range.
If we are discussing about traditional housing
price prediction, then this was involved house
cost price and sell price comparison. This
method is gradually failed to accept the
business standards. In the new technologically
advanced world, there are several proposed
methodologies are used to predict price. It has
been seen that data mining is the reliable
method to achieve this project. We have
proposed Crisp-Dm methodology which include
several steps to match business needs. And can
give perfect prediction for the model.
DATA MINING CA-02 PAGE 5
THE CRISP-DM Model:
• CRISP-DM is known as Cross-industry process which use for data mining purpose. It is a standard
process which contain total of six phases of a data mining process. It is a well proven methodology
which can give better result for our model.
• We are going to implement this to prove this method practically. All the six phases are involved in our
project. The six phases of CRISP-DM cycle are presented below for reference.
Fig.1[Six Phases Of CRISP-DM]
DATA MINING CA-02 PAGE 6
[Fig.2 Reference Model Description]
1. Business Understanding:
This section describes all the business problems and ideas behind this project. This step can be helpful
to understand the requirement to analyze the data using right tools and techniques.
1.1 Business Objective:
All the business objectives that we are trying to find in our project are described below:
• As we know that California is one of the fastest growing technology hubs in America. So, most of the
people are wanting to stay here for different career purpose. So, we are going to determine the price
distribution of each area of California and explore the income rate distributed by people.
• The second purpose is to determine whether some factors are affecting the price distribution.
Likewise, we know that a new house can be on high demand among people with massive price
increment. But we are going to explore whether the house age is affecting the price or anything else.
• Location of the house is always a good choice for every individual. in our propose model we can find
the impact of location on price for different individual. Finally, we predict the new price range based on
important attributes.
1.2 Stakeholders for this project:
Stakeholders are the business leaders or individuals who are merely affect by the business objectives.
By the end of this project the important stake holders like real estate agents can improve their market
DATA MINING CA-02 PAGE 7
strategy with clients for selling new properties. Like wise clients are the second stakeholders who can
get a better picture of price distribution in different cities. Finally, government body is a most
important stake holder who can get a marginal benefit from this business problem. They can protect
the house value distribution to tackle with different house crisis.
1.3 Benefits:
This section describes all the benefits may acquire by implementing this project.
• Real estate agents can get an original price distribution which can be helpful for their business strategy.
• Clients can get a perfect house value so that they can resist themselves from fraud agents asking for
maximum value.
• Government can resist the corruption in housing sectors.
• Poor home less people can get a right value for their property. Which can reduce the homelessness in
the country.
1.4Business Constraints:
As this process is not generalized so this could be a business constraint. We have used a old dataset
whose price may be not similar as present house value. The price distribution is restricted with some
limited number of attributes. So, it can be difficult to assume this model for further business
implement. We can resolve this problem by use new data with current demand factors.
2.DATA UNDERSTANDING:
Data understanding is an important factor in every data mining project. understand the data better can
help to choose right model for the machine learning project. Understand the data can give us a clear
picture about price distribution and fluctuation. Which can be helpful for the non-technical business
leaders to understand the aspects of the project.
2.1About the dataset:
The data contains all information that could be found in 1990 California census. The data has been
obtained from Kaggle. It has 10 attributes or columns related to housing such as house age,
population, location and the bedroom preference etc. These attributes are used to predict the price
range distribution.
Data source: https://www.kaggle.com/camnugent/california-housing-prices/download
DATA MINING CA-02 PAGE 8
Fig[Attribute description]
2.2 Exploratory data Analysis:
This step can add a good visualization of data like data types we have, Important columns to keep for
the project. It can give the idea about the whole data. Which can be helpful in further process in data
preparation. We have used python Jupiter notebook for a simple visualization of the used data set.
Fig 3[data types] Fig 4[column Name]
From the above picture we can see that we have a data set with 20640 numbers of columns and only
contains 10 attributes. Following to the attributes we have all integer characters with only the location
type that is ocean proximity is contain categorical variable. Furthermore, we can clearly see that the
house value is continuous do here we can implement the regression algorithm. Furthermore, we are
going explore the data more to determine the steps.
DATA MINING CA-02 PAGE 9
Fig[House Value VS population]
This is the scatter plot shows the house value distribution on the basis of population. We can see that
only few house values are so high according to the population.
Fig[total bedroom VS Median income]
The above figure shows that the median income which distributes with bedrooms. This can give us a
clear idea that the more is the income more bedroom per cubic area they want. However, only few
people with very high income distributed with bedroom size.
DATA MINING CA-02 PAGE 10
Fig 5[ocean proximity]
From the above picture we can see that the ocean proximity data contain all the categorical values
which we must convert to numerical to conduct our process.
3.DATA PREPARATION:
Data preparation is the third most important step in data mining project. In this step we usually
prepare the data for the model. This step involves clean all the unusable columns and data which can
be affect the prediction attribute. All the data preparation steps are described below.
Clean the missing values:
In this step we clan or fill up the missing values in our dataset to make it more reliable for the
prediction.
Changing the data types:
If our dataset contains dissimilar data types, then that could be make unavoidable problem in further
process. In this step we change all the categorical values to numerical to make the model simpler to
predict.
Remove all duplicates values:
if our prediction model contains any duplicate value then it could be leads to prediction bias. so, we
must remove all the duplicate values from our data set.
Data normalization:
Data normalization is an important step in mining projects to make prediction reliable, but we can see
that we do not have any different numerical values, so we don’t need data normalization in our model.
All the data preparation process in our data set are described below.
DATA MINING CA-02 PAGE 11
3.1 Replace Missing values:
We check all the possible missing values in our data set in both python and rapid miner. The outputs
are shown below.
Fig[Missing values]
This above picture describes the missing values in python. We can see that there are total 207 missing
values present in our data set in Total bedroom column. Furthermore, we can remove all these missing
values by the mean value of that specific column.
Fig[missing value Rapidminer]
We have also used rapid miner to replace all the missing values in our data set. We replace the total
bedroom missing values with the average value of that column. For that process we have to choose the
replace missing value operator from the rapid miner operator box then we can simply choose the
attribute name which we want to clean then we choose the mode of replace as average.
DATA MINING CA-02 PAGE 12
After this process we moved forward to change the categorical variable to numeric in our dataset. As
we previously described we have a column name ocean proximity can train some categorical variable.
We use pythons label encoder command to change all that 5 data types to numeric. Completing the
transformation step the next step is about feature selection which described below.
3.2 Feature selection:
Feature selection is a import step in mining projects. We choose important features from dataset to
increase our prediction accuracy. Feature selection can be done by various methods like filter methods
and wrapped methods. In this project we have implemented filter method which can describe the co
relation between the attributes. The pictorial representation of feature selection can be shown below.
Fig[bivariant, univariant]
The above shown picture is a bi variant and univariant plot which describes the relation among
attributes. which can helpful further to choose right attribute for modeling purpose. However, we have
used a correlation plot for batter visualization. That can be shown below.
DATA MINING CA-02 PAGE 13
Fig [co-relation matrix]
After plotting the co relation matrix we can see that the first two column that are longitude and
latitude have either high or very low co relation with every attribute. except these two all other
attributes are in normal co relation with each other. However, we can see that the median house value
column has the all high co relation with every attribute. So, we can choose that column as the label
column to predict the price range. After this step we further described the steps we have used in rapid
miner for feature selection.
DATA MINING CA-02 PAGE 14
Steps in rapid miner:
We have used select attribute operator in rapid miner and choose the attribute we want to keep in
further process according to the above correlation plot. The picture of the operator and all the
attribute can be shown below.
Fig[select attribute rapidminer]
3.3 Data Splitting:
In this step we split our final dataset into test and train set. However, we split our dataset in both rapid
miner and python. You can see below that we have split our data in 70% and 30% proportion in
RapidMiner. on the other hand, we have also split our data in 80% and 20% in python.
Fig[data split python]
DATA MINING CA-02 PAGE 15
Fig[data split in rapid miner]
4.Modeling:
In the modelling phase we have implemented leaner regression. As I previously mentioned we have a
continuous data with median house value. So, in linear regression we have a target variable which
predict the value based on other independent variables which we set as x variable. In this dataset we
have our target variable y as median house value. on the other hand, we have our independent
variable as x. We have both propose linear regression in RapidMiner and python. However, in case of
Auto model, we found that gradient boosted tree has the highest accuracy with minimum run time. So
finally, we choose gradient boosted tree in rapid miner auto mode. All the process can be shown
below.
Fig[model python]
DATA MINING CA-02 PAGE 16
Fig[model Rapidminer]
Above picture shows that how we built the model in rapid miner. All the operators we have used to
build the model are described below.
Firstly, we have loaded the dataset in RapidMiner using retrieve operator. Furthermore, we have
investigated all the missing values with the help of data statistic option. likewise, we have set the
replace missing value operator to replace all the missing value with its average value. After this process
we have used the co relation matrix operator to show the co relation coefficients of different
attributes. Finally, we have set the role of the attribute median house value as label or target column.
Furthermore, we have split the data with split data operator. Lastly, we applied linear regression model
to fit the data with it.
4.1 Auto model:
When we talk about the auto model. Auto model selected various models but out of these models we
show that Gradient boosted tree gives us a reliable accuracy and first-time consuming output. So
finally, we select the gradient boosted tree. The result of GBT can be shown below.
DATA MINING CA-02 PAGE 17
Fig[Auto model]
Fig[Gradient Boosted Tree]
On the above picture we can see that the Gradient boosted tree has taken the root node as total rooms.
Where population and total bedroom as the subbranch of the tree to predict the price from the data.
DATA MINING CA-02 PAGE 18
5.Evaluation:
In the evaluation phase we have evaluate both of our Python and rapid miner model. We got a root mean
square error in python regression model. However, after comparing both the root mean square error in
python and RapidMiner we got to know that we have a higher root mean square error in rapid miner that is
90147.99 . Both of the picture is shown below.
Fig[Evaluation RapidMiner]
Fig[Evaluation python]
Finally, we found that there are some attributes which affected our model accuracy. As we have not chosen
total bedroom and income in our rapid miner and it shows less accuracy then python model. So, we can sure
that income and bedroom size can put effect on house value prediction.
DATA MINING CA-02 PAGE 19
6.Deployment:
Fig[prediction result]
After evaluation phase we deployed our model to compare the prediction price with the original price and to
find important insights from the prediction which can help to tackle the business problems. Here we can
clearly see that there is a huge price difference in our prediction model. However total bedroom and
population can affect our prediction hugely.
6.1Business outcomes:
According to the previously discussed business objectives we have the outcomes as follows.
We can see that total bedroom and population put a higher effect on price difference. However, we have not
taken the income attribute.
At first, we thought that location of a house can merely effect on the house price fluctuation but here from
the result we can conclude that. It has a very less effectiveness on price distribution.
Likewise, Age of the property is also has very negative effect on price distribution. It shown that some of the
old houses has higher price range rather than new houses, but the result is different in different prospects.
When we see that population is affect the house price when it combines positively with house age.
Conclusion:
We have coverup all the business problems by our model. Likewise, we have implemented all the steps of
Crisp_DM model. However, it was difficult to increase our accuracy of the model. As there are some outliers
present in our data. However, we had not taken the income attribute in our RapidMiner model. As we think is
an important attribute. In the further process it could increase the accuracy of our model.
DATA MINING CA-02 PAGE 20

More Related Content

What's hot

Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
 
Prediction of house price using multiple regression
Prediction of house price using multiple regressionPrediction of house price using multiple regression
Prediction of house price using multiple regressionvinovk
 
Data Science: Prediction analysis for houses in Ames, Iowa.
Data Science: Prediction analysis for houses in Ames, Iowa.Data Science: Prediction analysis for houses in Ames, Iowa.
Data Science: Prediction analysis for houses in Ames, Iowa.ASHISH MENKUDALE
 
House Price Prediction.pptx
House Price Prediction.pptxHouse Price Prediction.pptx
House Price Prediction.pptxCodingWorld5
 
House Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning AlgorithmHouse Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning Algorithmijtsrd
 
House Price Prediction Using Machine Learning
House Price Prediction Using Machine LearningHouse Price Prediction Using Machine Learning
House Price Prediction Using Machine LearningIRJET Journal
 
Knowledge representation and Predicate logic
Knowledge representation and Predicate logicKnowledge representation and Predicate logic
Knowledge representation and Predicate logicAmey Kerkar
 
Artificial Neural Networks 1
Artificial Neural Networks 1Artificial Neural Networks 1
Artificial Neural Networks 1swapnac12
 
Ai lecture 14(unit03)
Ai lecture  14(unit03)Ai lecture  14(unit03)
Ai lecture 14(unit03)vikas dhakane
 
AI Informed Search Strategies by Examples
AI Informed Search Strategies by ExamplesAI Informed Search Strategies by Examples
AI Informed Search Strategies by ExamplesAhmed Gad
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term depositPranov Mishra
 
The Knowledge Graph Explosion
The Knowledge Graph ExplosionThe Knowledge Graph Explosion
The Knowledge Graph ExplosionNeo4j
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learningDr Wei Liu
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersPier Luca Lanzi
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?Thanakrit Lersmethasakul
 

What's hot (20)

Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Prediction of house price using multiple regression
Prediction of house price using multiple regressionPrediction of house price using multiple regression
Prediction of house price using multiple regression
 
Data Science: Prediction analysis for houses in Ames, Iowa.
Data Science: Prediction analysis for houses in Ames, Iowa.Data Science: Prediction analysis for houses in Ames, Iowa.
Data Science: Prediction analysis for houses in Ames, Iowa.
 
House Price Prediction.pptx
House Price Prediction.pptxHouse Price Prediction.pptx
House Price Prediction.pptx
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
House Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning AlgorithmHouse Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning Algorithm
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
House Price Prediction Using Machine Learning
House Price Prediction Using Machine LearningHouse Price Prediction Using Machine Learning
House Price Prediction Using Machine Learning
 
Knowledge representation and Predicate logic
Knowledge representation and Predicate logicKnowledge representation and Predicate logic
Knowledge representation and Predicate logic
 
Artificial Neural Networks 1
Artificial Neural Networks 1Artificial Neural Networks 1
Artificial Neural Networks 1
 
Ai lecture 14(unit03)
Ai lecture  14(unit03)Ai lecture  14(unit03)
Ai lecture 14(unit03)
 
AI Informed Search Strategies by Examples
AI Informed Search Strategies by ExamplesAI Informed Search Strategies by Examples
AI Informed Search Strategies by Examples
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term deposit
 
The Knowledge Graph Explosion
The Knowledge Graph ExplosionThe Knowledge Graph Explosion
The Knowledge Graph Explosion
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learning
 
Tamr overview
Tamr overviewTamr overview
Tamr overview
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
 
Data Visualization: Sales forecasting
Data Visualization: Sales forecastingData Visualization: Sales forecasting
Data Visualization: Sales forecasting
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
 

Similar to Prediction of housing price

MS5103BusinessAnalyticsProject
MS5103BusinessAnalyticsProjectMS5103BusinessAnalyticsProject
MS5103BusinessAnalyticsProjectBrian Connolly
 
PR Business-Cred-Packet
PR Business-Cred-PacketPR Business-Cred-Packet
PR Business-Cred-PacketDavid Grimm
 
Essays On The American Revolution
Essays On The American RevolutionEssays On The American Revolution
Essays On The American RevolutionBecky Strickland
 
Technology’s Impact on the Future of Events | MPI Toronto May 2017
Technology’s Impact on the Future of Events |  MPI Toronto May 2017Technology’s Impact on the Future of Events |  MPI Toronto May 2017
Technology’s Impact on the Future of Events | MPI Toronto May 2017Social Tables
 
Rubia Properties - Private Lending Packet
Rubia Properties - Private Lending PacketRubia Properties - Private Lending Packet
Rubia Properties - Private Lending PacketSuzanne Beltran
 
Retail analytics - Improvising pricing strategy using markup/markdown
Retail analytics - Improvising pricing strategy using markup/markdownRetail analytics - Improvising pricing strategy using markup/markdown
Retail analytics - Improvising pricing strategy using markup/markdownSmitha Mysore Lokesh
 
The Manufacturer 2015
The Manufacturer 2015The Manufacturer 2015
The Manufacturer 2015frankyburger
 
Descriptive And Narrative Essay.pdf
Descriptive And Narrative Essay.pdfDescriptive And Narrative Essay.pdf
Descriptive And Narrative Essay.pdfPamela Brown
 
Descriptive And Narrative Essay. Example Of Narrative Essay About Experience ...
Descriptive And Narrative Essay. Example Of Narrative Essay About Experience ...Descriptive And Narrative Essay. Example Of Narrative Essay About Experience ...
Descriptive And Narrative Essay. Example Of Narrative Essay About Experience ...Keisha Paulino
 
CBRE Integrated Supply Chain & Real Estate Solutions
CBRE Integrated Supply Chain & Real Estate SolutionsCBRE Integrated Supply Chain & Real Estate Solutions
CBRE Integrated Supply Chain & Real Estate SolutionsGary T. Saykaly
 
SolarCity Plans Book
SolarCity Plans BookSolarCity Plans Book
SolarCity Plans BookSachi Howard
 
SolarCity Plans Book
SolarCity Plans BookSolarCity Plans Book
SolarCity Plans BookLauren Cox
 
Rehearsal Script Page 1 Introduction Lets get down t.docx
Rehearsal Script Page 1  Introduction Lets get down t.docxRehearsal Script Page 1  Introduction Lets get down t.docx
Rehearsal Script Page 1 Introduction Lets get down t.docxdebishakespeare
 
Fresh Tek Business Plan
Fresh Tek Business PlanFresh Tek Business Plan
Fresh Tek Business Planjosedwyer
 
UMD Student to Business Initiative
UMD Student to Business Initiative UMD Student to Business Initiative
UMD Student to Business Initiative Steven Popowitz
 
Real Estate's Big Data Revolution: The New Way to Create Value
Real Estate's Big Data Revolution: The New Way to Create ValueReal Estate's Big Data Revolution: The New Way to Create Value
Real Estate's Big Data Revolution: The New Way to Create ValueHouseCanary
 
Global Business Environment Project
Global Business Environment ProjectGlobal Business Environment Project
Global Business Environment ProjectAkanksha Verma
 

Similar to Prediction of housing price (20)

Drowning in Data
Drowning in DataDrowning in Data
Drowning in Data
 
New Business Plan
New Business PlanNew Business Plan
New Business Plan
 
MS5103BusinessAnalyticsProject
MS5103BusinessAnalyticsProjectMS5103BusinessAnalyticsProject
MS5103BusinessAnalyticsProject
 
PR Business-Cred-Packet
PR Business-Cred-PacketPR Business-Cred-Packet
PR Business-Cred-Packet
 
Essays On The American Revolution
Essays On The American RevolutionEssays On The American Revolution
Essays On The American Revolution
 
Big data
Big dataBig data
Big data
 
Technology’s Impact on the Future of Events | MPI Toronto May 2017
Technology’s Impact on the Future of Events |  MPI Toronto May 2017Technology’s Impact on the Future of Events |  MPI Toronto May 2017
Technology’s Impact on the Future of Events | MPI Toronto May 2017
 
Rubia Properties - Private Lending Packet
Rubia Properties - Private Lending PacketRubia Properties - Private Lending Packet
Rubia Properties - Private Lending Packet
 
Retail analytics - Improvising pricing strategy using markup/markdown
Retail analytics - Improvising pricing strategy using markup/markdownRetail analytics - Improvising pricing strategy using markup/markdown
Retail analytics - Improvising pricing strategy using markup/markdown
 
The Manufacturer 2015
The Manufacturer 2015The Manufacturer 2015
The Manufacturer 2015
 
Descriptive And Narrative Essay.pdf
Descriptive And Narrative Essay.pdfDescriptive And Narrative Essay.pdf
Descriptive And Narrative Essay.pdf
 
Descriptive And Narrative Essay. Example Of Narrative Essay About Experience ...
Descriptive And Narrative Essay. Example Of Narrative Essay About Experience ...Descriptive And Narrative Essay. Example Of Narrative Essay About Experience ...
Descriptive And Narrative Essay. Example Of Narrative Essay About Experience ...
 
CBRE Integrated Supply Chain & Real Estate Solutions
CBRE Integrated Supply Chain & Real Estate SolutionsCBRE Integrated Supply Chain & Real Estate Solutions
CBRE Integrated Supply Chain & Real Estate Solutions
 
SolarCity Plans Book
SolarCity Plans BookSolarCity Plans Book
SolarCity Plans Book
 
SolarCity Plans Book
SolarCity Plans BookSolarCity Plans Book
SolarCity Plans Book
 
Rehearsal Script Page 1 Introduction Lets get down t.docx
Rehearsal Script Page 1  Introduction Lets get down t.docxRehearsal Script Page 1  Introduction Lets get down t.docx
Rehearsal Script Page 1 Introduction Lets get down t.docx
 
Fresh Tek Business Plan
Fresh Tek Business PlanFresh Tek Business Plan
Fresh Tek Business Plan
 
UMD Student to Business Initiative
UMD Student to Business Initiative UMD Student to Business Initiative
UMD Student to Business Initiative
 
Real Estate's Big Data Revolution: The New Way to Create Value
Real Estate's Big Data Revolution: The New Way to Create ValueReal Estate's Big Data Revolution: The New Way to Create Value
Real Estate's Big Data Revolution: The New Way to Create Value
 
Global Business Environment Project
Global Business Environment ProjectGlobal Business Environment Project
Global Business Environment Project
 

Recently uploaded

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...vershagrag
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...HyderabadDolls
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...HyderabadDolls
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridihmeghakumariji156
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 

Recently uploaded (20)

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

Prediction of housing price

  • 1. DATA MINING CA-02 PAGE 2 Contents Abstract .......................................................................................................................................................................4 Introduction .................................................................................................................................................................. 4 The CRISP-DM model ...................................................................................................................................................5 1.Business Understanding.....................................................................................................................................6 1.1 Business Objectives...............................................................................................................................................5 1.2 Stake holders for this project.............................................................................................................................. 6 1.3 Benefits ....................................................................................................................................... 7 1.4 Business constraints .................................................................................................................... 7 2 Data understanding ...................................................................................................................... 7 2.1 About the dataset .........................................................................................................................8 2.2 EDA .............................................................................................................................................8 3 Data preparation ......................................................................................................................... 13 3.1 Replace missing value ................................................................................................................ 13 3.2 Feature selection........................................................................................................................ 13 3.3 Data splitting ..............................................................................................................................15 4 Modeling........................................................................................................................................ 16 4.1 Auto model................................................................................................................................. 16 5 Evaluation ..................................................................................................................................... 19 6 Deployment...................................................................................................................................20 6.1 Business outcomes......................................................................................................................20 Conclusion .......................................................................................................................................20 IMAGES [Reference model].................................................................................................................................6 [Attribute of dataset]............................................................................................................................. 7 [data types & Column Name]................................................................................................................8 [Missing value] ................................................................................................................................... 12 [Select attribute] ..................................................................................................................................15 [data splitting] .....................................................................................................................................15 [Model] ............................................................................................................................................... 18 [Auto model]....................................................................................................................................... 18 [Gradient boosted tree]....................................................................................................................... 19 [Evaluation]........................................................................................................................................ 19 [Evaluation python]............................................................................................................................ 19 [Result of prediction]..........................................................................................................................20
  • 2. DATA MINING CA-02 PAGE 3 Buying or built a perfect house is a lifetime goal for every human being. However, most of the people did wrong while purchase a property. Likewise, many Housing agents provide wrong information to their client about the price. Finally, many people end up buying worthless property with high value. Aim of this project is to predict the housing price and evaluate insights over price distribution in California housing data set. By the end of this project we will be able to know the right price range distribution which depend over various factor. However, this project can help both clients and real estate agents to choose right price fluctuation.
  • 3. DATA MINING CA-02 PAGE 4 PROJECT REPORT Abstract: A place to stay or house is a basic need of every individual of the world whether it comes to human or animal. However, it has been seen that lots of individual are being homeless. As they cannot afford a perfect house. If we take America’s California state which known as the land of golden dreams. Now it becomes peoples worst housing nightmare. As the median house range is about $600,000 for two people. Likewise, if we investigate Ireland housing crisis then it seems very worst in Dublin, a leading tech hub of the country. As the house price rising dramatically the whole world is suffering with this crisis. What is the fault behind this issue? Now everyone criticize government who fail to protect the house price. However, we can propose different process to tackle this problem and predict the price range distribution. Which can be helpful for both customer and both real estate agents. We have proposed CRISP-DM model and RapidMiner tool to predict price distribution. Introduction: Increasing population in the world people looking to buy new house as per their budget seem to be conservative and need more market strategy for house agents. As house price increase dramatically every year there should be a system to predict new house price according to the demand of people for house size, Bedroom size and location. This could merely help real estate agents to decide house price for their clients. There are several methods proposed to determine house price range. If we are discussing about traditional housing price prediction, then this was involved house cost price and sell price comparison. This method is gradually failed to accept the business standards. In the new technologically advanced world, there are several proposed methodologies are used to predict price. It has been seen that data mining is the reliable method to achieve this project. We have proposed Crisp-Dm methodology which include several steps to match business needs. And can give perfect prediction for the model.
  • 4. DATA MINING CA-02 PAGE 5 THE CRISP-DM Model: • CRISP-DM is known as Cross-industry process which use for data mining purpose. It is a standard process which contain total of six phases of a data mining process. It is a well proven methodology which can give better result for our model. • We are going to implement this to prove this method practically. All the six phases are involved in our project. The six phases of CRISP-DM cycle are presented below for reference. Fig.1[Six Phases Of CRISP-DM]
  • 5. DATA MINING CA-02 PAGE 6 [Fig.2 Reference Model Description] 1. Business Understanding: This section describes all the business problems and ideas behind this project. This step can be helpful to understand the requirement to analyze the data using right tools and techniques. 1.1 Business Objective: All the business objectives that we are trying to find in our project are described below: • As we know that California is one of the fastest growing technology hubs in America. So, most of the people are wanting to stay here for different career purpose. So, we are going to determine the price distribution of each area of California and explore the income rate distributed by people. • The second purpose is to determine whether some factors are affecting the price distribution. Likewise, we know that a new house can be on high demand among people with massive price increment. But we are going to explore whether the house age is affecting the price or anything else. • Location of the house is always a good choice for every individual. in our propose model we can find the impact of location on price for different individual. Finally, we predict the new price range based on important attributes. 1.2 Stakeholders for this project: Stakeholders are the business leaders or individuals who are merely affect by the business objectives. By the end of this project the important stake holders like real estate agents can improve their market
  • 6. DATA MINING CA-02 PAGE 7 strategy with clients for selling new properties. Like wise clients are the second stakeholders who can get a better picture of price distribution in different cities. Finally, government body is a most important stake holder who can get a marginal benefit from this business problem. They can protect the house value distribution to tackle with different house crisis. 1.3 Benefits: This section describes all the benefits may acquire by implementing this project. • Real estate agents can get an original price distribution which can be helpful for their business strategy. • Clients can get a perfect house value so that they can resist themselves from fraud agents asking for maximum value. • Government can resist the corruption in housing sectors. • Poor home less people can get a right value for their property. Which can reduce the homelessness in the country. 1.4Business Constraints: As this process is not generalized so this could be a business constraint. We have used a old dataset whose price may be not similar as present house value. The price distribution is restricted with some limited number of attributes. So, it can be difficult to assume this model for further business implement. We can resolve this problem by use new data with current demand factors. 2.DATA UNDERSTANDING: Data understanding is an important factor in every data mining project. understand the data better can help to choose right model for the machine learning project. Understand the data can give us a clear picture about price distribution and fluctuation. Which can be helpful for the non-technical business leaders to understand the aspects of the project. 2.1About the dataset: The data contains all information that could be found in 1990 California census. The data has been obtained from Kaggle. It has 10 attributes or columns related to housing such as house age, population, location and the bedroom preference etc. These attributes are used to predict the price range distribution. Data source: https://www.kaggle.com/camnugent/california-housing-prices/download
  • 7. DATA MINING CA-02 PAGE 8 Fig[Attribute description] 2.2 Exploratory data Analysis: This step can add a good visualization of data like data types we have, Important columns to keep for the project. It can give the idea about the whole data. Which can be helpful in further process in data preparation. We have used python Jupiter notebook for a simple visualization of the used data set. Fig 3[data types] Fig 4[column Name] From the above picture we can see that we have a data set with 20640 numbers of columns and only contains 10 attributes. Following to the attributes we have all integer characters with only the location type that is ocean proximity is contain categorical variable. Furthermore, we can clearly see that the house value is continuous do here we can implement the regression algorithm. Furthermore, we are going explore the data more to determine the steps.
  • 8. DATA MINING CA-02 PAGE 9 Fig[House Value VS population] This is the scatter plot shows the house value distribution on the basis of population. We can see that only few house values are so high according to the population. Fig[total bedroom VS Median income] The above figure shows that the median income which distributes with bedrooms. This can give us a clear idea that the more is the income more bedroom per cubic area they want. However, only few people with very high income distributed with bedroom size.
  • 9. DATA MINING CA-02 PAGE 10 Fig 5[ocean proximity] From the above picture we can see that the ocean proximity data contain all the categorical values which we must convert to numerical to conduct our process. 3.DATA PREPARATION: Data preparation is the third most important step in data mining project. In this step we usually prepare the data for the model. This step involves clean all the unusable columns and data which can be affect the prediction attribute. All the data preparation steps are described below. Clean the missing values: In this step we clan or fill up the missing values in our dataset to make it more reliable for the prediction. Changing the data types: If our dataset contains dissimilar data types, then that could be make unavoidable problem in further process. In this step we change all the categorical values to numerical to make the model simpler to predict. Remove all duplicates values: if our prediction model contains any duplicate value then it could be leads to prediction bias. so, we must remove all the duplicate values from our data set. Data normalization: Data normalization is an important step in mining projects to make prediction reliable, but we can see that we do not have any different numerical values, so we don’t need data normalization in our model. All the data preparation process in our data set are described below.
  • 10. DATA MINING CA-02 PAGE 11 3.1 Replace Missing values: We check all the possible missing values in our data set in both python and rapid miner. The outputs are shown below. Fig[Missing values] This above picture describes the missing values in python. We can see that there are total 207 missing values present in our data set in Total bedroom column. Furthermore, we can remove all these missing values by the mean value of that specific column. Fig[missing value Rapidminer] We have also used rapid miner to replace all the missing values in our data set. We replace the total bedroom missing values with the average value of that column. For that process we have to choose the replace missing value operator from the rapid miner operator box then we can simply choose the attribute name which we want to clean then we choose the mode of replace as average.
  • 11. DATA MINING CA-02 PAGE 12 After this process we moved forward to change the categorical variable to numeric in our dataset. As we previously described we have a column name ocean proximity can train some categorical variable. We use pythons label encoder command to change all that 5 data types to numeric. Completing the transformation step the next step is about feature selection which described below. 3.2 Feature selection: Feature selection is a import step in mining projects. We choose important features from dataset to increase our prediction accuracy. Feature selection can be done by various methods like filter methods and wrapped methods. In this project we have implemented filter method which can describe the co relation between the attributes. The pictorial representation of feature selection can be shown below. Fig[bivariant, univariant] The above shown picture is a bi variant and univariant plot which describes the relation among attributes. which can helpful further to choose right attribute for modeling purpose. However, we have used a correlation plot for batter visualization. That can be shown below.
  • 12. DATA MINING CA-02 PAGE 13 Fig [co-relation matrix] After plotting the co relation matrix we can see that the first two column that are longitude and latitude have either high or very low co relation with every attribute. except these two all other attributes are in normal co relation with each other. However, we can see that the median house value column has the all high co relation with every attribute. So, we can choose that column as the label column to predict the price range. After this step we further described the steps we have used in rapid miner for feature selection.
  • 13. DATA MINING CA-02 PAGE 14 Steps in rapid miner: We have used select attribute operator in rapid miner and choose the attribute we want to keep in further process according to the above correlation plot. The picture of the operator and all the attribute can be shown below. Fig[select attribute rapidminer] 3.3 Data Splitting: In this step we split our final dataset into test and train set. However, we split our dataset in both rapid miner and python. You can see below that we have split our data in 70% and 30% proportion in RapidMiner. on the other hand, we have also split our data in 80% and 20% in python. Fig[data split python]
  • 14. DATA MINING CA-02 PAGE 15 Fig[data split in rapid miner] 4.Modeling: In the modelling phase we have implemented leaner regression. As I previously mentioned we have a continuous data with median house value. So, in linear regression we have a target variable which predict the value based on other independent variables which we set as x variable. In this dataset we have our target variable y as median house value. on the other hand, we have our independent variable as x. We have both propose linear regression in RapidMiner and python. However, in case of Auto model, we found that gradient boosted tree has the highest accuracy with minimum run time. So finally, we choose gradient boosted tree in rapid miner auto mode. All the process can be shown below. Fig[model python]
  • 15. DATA MINING CA-02 PAGE 16 Fig[model Rapidminer] Above picture shows that how we built the model in rapid miner. All the operators we have used to build the model are described below. Firstly, we have loaded the dataset in RapidMiner using retrieve operator. Furthermore, we have investigated all the missing values with the help of data statistic option. likewise, we have set the replace missing value operator to replace all the missing value with its average value. After this process we have used the co relation matrix operator to show the co relation coefficients of different attributes. Finally, we have set the role of the attribute median house value as label or target column. Furthermore, we have split the data with split data operator. Lastly, we applied linear regression model to fit the data with it. 4.1 Auto model: When we talk about the auto model. Auto model selected various models but out of these models we show that Gradient boosted tree gives us a reliable accuracy and first-time consuming output. So finally, we select the gradient boosted tree. The result of GBT can be shown below.
  • 16. DATA MINING CA-02 PAGE 17 Fig[Auto model] Fig[Gradient Boosted Tree] On the above picture we can see that the Gradient boosted tree has taken the root node as total rooms. Where population and total bedroom as the subbranch of the tree to predict the price from the data.
  • 17. DATA MINING CA-02 PAGE 18 5.Evaluation: In the evaluation phase we have evaluate both of our Python and rapid miner model. We got a root mean square error in python regression model. However, after comparing both the root mean square error in python and RapidMiner we got to know that we have a higher root mean square error in rapid miner that is 90147.99 . Both of the picture is shown below. Fig[Evaluation RapidMiner] Fig[Evaluation python] Finally, we found that there are some attributes which affected our model accuracy. As we have not chosen total bedroom and income in our rapid miner and it shows less accuracy then python model. So, we can sure that income and bedroom size can put effect on house value prediction.
  • 18. DATA MINING CA-02 PAGE 19 6.Deployment: Fig[prediction result] After evaluation phase we deployed our model to compare the prediction price with the original price and to find important insights from the prediction which can help to tackle the business problems. Here we can clearly see that there is a huge price difference in our prediction model. However total bedroom and population can affect our prediction hugely. 6.1Business outcomes: According to the previously discussed business objectives we have the outcomes as follows. We can see that total bedroom and population put a higher effect on price difference. However, we have not taken the income attribute. At first, we thought that location of a house can merely effect on the house price fluctuation but here from the result we can conclude that. It has a very less effectiveness on price distribution. Likewise, Age of the property is also has very negative effect on price distribution. It shown that some of the old houses has higher price range rather than new houses, but the result is different in different prospects. When we see that population is affect the house price when it combines positively with house age. Conclusion: We have coverup all the business problems by our model. Likewise, we have implemented all the steps of Crisp_DM model. However, it was difficult to increase our accuracy of the model. As there are some outliers present in our data. However, we had not taken the income attribute in our RapidMiner model. As we think is an important attribute. In the further process it could increase the accuracy of our model.
  • 19. DATA MINING CA-02 PAGE 20