• Have used and demonstrated CRISP-DM methodology throughout the project.
• Used RapidMiner tool to automatically adapt all the possible attributes and operator to provide the prediction.
• Have used different algorithms like Decision tree, Random forest, and Gradient boosted tree to predict price distribution and created the simulation of the result.
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
Prediction of housing price
1. DATA MINING CA-02 PAGE 2
Contents
Abstract .......................................................................................................................................................................4
Introduction .................................................................................................................................................................. 4
The CRISP-DM model ...................................................................................................................................................5
1.Business Understanding.....................................................................................................................................6
1.1 Business Objectives...............................................................................................................................................5
1.2 Stake holders for this project.............................................................................................................................. 6
1.3 Benefits ....................................................................................................................................... 7
1.4 Business constraints .................................................................................................................... 7
2 Data understanding ...................................................................................................................... 7
2.1 About the dataset .........................................................................................................................8
2.2 EDA .............................................................................................................................................8
3 Data preparation ......................................................................................................................... 13
3.1 Replace missing value ................................................................................................................ 13
3.2 Feature selection........................................................................................................................ 13
3.3 Data splitting ..............................................................................................................................15
4 Modeling........................................................................................................................................ 16
4.1 Auto model................................................................................................................................. 16
5 Evaluation ..................................................................................................................................... 19
6 Deployment...................................................................................................................................20
6.1 Business outcomes......................................................................................................................20
Conclusion .......................................................................................................................................20
IMAGES
[Reference model].................................................................................................................................6
[Attribute of dataset]............................................................................................................................. 7
[data types & Column Name]................................................................................................................8
[Missing value] ................................................................................................................................... 12
[Select attribute] ..................................................................................................................................15
[data splitting] .....................................................................................................................................15
[Model] ............................................................................................................................................... 18
[Auto model]....................................................................................................................................... 18
[Gradient boosted tree]....................................................................................................................... 19
[Evaluation]........................................................................................................................................ 19
[Evaluation python]............................................................................................................................ 19
[Result of prediction]..........................................................................................................................20
2. DATA MINING CA-02 PAGE 3
Buying or built a perfect house is a
lifetime goal for every human
being. However, most of the
people did wrong while purchase a
property. Likewise, many Housing
agents provide wrong information
to their client about the price.
Finally, many people end up
buying worthless property with
high value.
Aim of this project is to predict the
housing price and evaluate
insights over price distribution in
California housing data set. By the
end of this project we will be able
to know the right price range
distribution which depend over
various factor. However, this
project can help both clients and
real estate agents to choose right
price fluctuation.
3. DATA MINING CA-02 PAGE 4
PROJECT REPORT
Abstract:
A place to stay or house is a basic need of every individual of the world whether it comes to human
or animal. However, it has been seen that lots of individual are being homeless. As they cannot
afford a perfect house. If we take America’s California state which known as the land of golden
dreams. Now it becomes peoples worst housing nightmare. As the median house range is about
$600,000 for two people. Likewise, if we investigate Ireland housing crisis then it seems very worst
in Dublin, a leading tech hub of the country. As the house price rising dramatically the whole world
is suffering with this crisis. What is the fault behind this issue? Now everyone criticize government
who fail to protect the house price. However, we can propose different process to tackle this
problem and predict the price range distribution. Which can be helpful for both customer and both
real estate agents. We have proposed CRISP-DM model and RapidMiner tool to predict price
distribution.
Introduction:
Increasing population in the world people
looking to buy new house as per their
budget seem to be conservative and need
more market strategy for house agents. As
house price increase dramatically every
year there should be a system to predict
new house price according to the demand
of people for house size, Bedroom size and
location. This could merely help real estate
agents to decide house price for their
clients. There are several methods
proposed to determine house price range.
If we are discussing about traditional housing
price prediction, then this was involved house
cost price and sell price comparison. This
method is gradually failed to accept the
business standards. In the new technologically
advanced world, there are several proposed
methodologies are used to predict price. It has
been seen that data mining is the reliable
method to achieve this project. We have
proposed Crisp-Dm methodology which include
several steps to match business needs. And can
give perfect prediction for the model.
4. DATA MINING CA-02 PAGE 5
THE CRISP-DM Model:
• CRISP-DM is known as Cross-industry process which use for data mining purpose. It is a standard
process which contain total of six phases of a data mining process. It is a well proven methodology
which can give better result for our model.
• We are going to implement this to prove this method practically. All the six phases are involved in our
project. The six phases of CRISP-DM cycle are presented below for reference.
Fig.1[Six Phases Of CRISP-DM]
5. DATA MINING CA-02 PAGE 6
[Fig.2 Reference Model Description]
1. Business Understanding:
This section describes all the business problems and ideas behind this project. This step can be helpful
to understand the requirement to analyze the data using right tools and techniques.
1.1 Business Objective:
All the business objectives that we are trying to find in our project are described below:
• As we know that California is one of the fastest growing technology hubs in America. So, most of the
people are wanting to stay here for different career purpose. So, we are going to determine the price
distribution of each area of California and explore the income rate distributed by people.
• The second purpose is to determine whether some factors are affecting the price distribution.
Likewise, we know that a new house can be on high demand among people with massive price
increment. But we are going to explore whether the house age is affecting the price or anything else.
• Location of the house is always a good choice for every individual. in our propose model we can find
the impact of location on price for different individual. Finally, we predict the new price range based on
important attributes.
1.2 Stakeholders for this project:
Stakeholders are the business leaders or individuals who are merely affect by the business objectives.
By the end of this project the important stake holders like real estate agents can improve their market
6. DATA MINING CA-02 PAGE 7
strategy with clients for selling new properties. Like wise clients are the second stakeholders who can
get a better picture of price distribution in different cities. Finally, government body is a most
important stake holder who can get a marginal benefit from this business problem. They can protect
the house value distribution to tackle with different house crisis.
1.3 Benefits:
This section describes all the benefits may acquire by implementing this project.
• Real estate agents can get an original price distribution which can be helpful for their business strategy.
• Clients can get a perfect house value so that they can resist themselves from fraud agents asking for
maximum value.
• Government can resist the corruption in housing sectors.
• Poor home less people can get a right value for their property. Which can reduce the homelessness in
the country.
1.4Business Constraints:
As this process is not generalized so this could be a business constraint. We have used a old dataset
whose price may be not similar as present house value. The price distribution is restricted with some
limited number of attributes. So, it can be difficult to assume this model for further business
implement. We can resolve this problem by use new data with current demand factors.
2.DATA UNDERSTANDING:
Data understanding is an important factor in every data mining project. understand the data better can
help to choose right model for the machine learning project. Understand the data can give us a clear
picture about price distribution and fluctuation. Which can be helpful for the non-technical business
leaders to understand the aspects of the project.
2.1About the dataset:
The data contains all information that could be found in 1990 California census. The data has been
obtained from Kaggle. It has 10 attributes or columns related to housing such as house age,
population, location and the bedroom preference etc. These attributes are used to predict the price
range distribution.
Data source: https://www.kaggle.com/camnugent/california-housing-prices/download
7. DATA MINING CA-02 PAGE 8
Fig[Attribute description]
2.2 Exploratory data Analysis:
This step can add a good visualization of data like data types we have, Important columns to keep for
the project. It can give the idea about the whole data. Which can be helpful in further process in data
preparation. We have used python Jupiter notebook for a simple visualization of the used data set.
Fig 3[data types] Fig 4[column Name]
From the above picture we can see that we have a data set with 20640 numbers of columns and only
contains 10 attributes. Following to the attributes we have all integer characters with only the location
type that is ocean proximity is contain categorical variable. Furthermore, we can clearly see that the
house value is continuous do here we can implement the regression algorithm. Furthermore, we are
going explore the data more to determine the steps.
8. DATA MINING CA-02 PAGE 9
Fig[House Value VS population]
This is the scatter plot shows the house value distribution on the basis of population. We can see that
only few house values are so high according to the population.
Fig[total bedroom VS Median income]
The above figure shows that the median income which distributes with bedrooms. This can give us a
clear idea that the more is the income more bedroom per cubic area they want. However, only few
people with very high income distributed with bedroom size.
9. DATA MINING CA-02 PAGE 10
Fig 5[ocean proximity]
From the above picture we can see that the ocean proximity data contain all the categorical values
which we must convert to numerical to conduct our process.
3.DATA PREPARATION:
Data preparation is the third most important step in data mining project. In this step we usually
prepare the data for the model. This step involves clean all the unusable columns and data which can
be affect the prediction attribute. All the data preparation steps are described below.
Clean the missing values:
In this step we clan or fill up the missing values in our dataset to make it more reliable for the
prediction.
Changing the data types:
If our dataset contains dissimilar data types, then that could be make unavoidable problem in further
process. In this step we change all the categorical values to numerical to make the model simpler to
predict.
Remove all duplicates values:
if our prediction model contains any duplicate value then it could be leads to prediction bias. so, we
must remove all the duplicate values from our data set.
Data normalization:
Data normalization is an important step in mining projects to make prediction reliable, but we can see
that we do not have any different numerical values, so we don’t need data normalization in our model.
All the data preparation process in our data set are described below.
10. DATA MINING CA-02 PAGE 11
3.1 Replace Missing values:
We check all the possible missing values in our data set in both python and rapid miner. The outputs
are shown below.
Fig[Missing values]
This above picture describes the missing values in python. We can see that there are total 207 missing
values present in our data set in Total bedroom column. Furthermore, we can remove all these missing
values by the mean value of that specific column.
Fig[missing value Rapidminer]
We have also used rapid miner to replace all the missing values in our data set. We replace the total
bedroom missing values with the average value of that column. For that process we have to choose the
replace missing value operator from the rapid miner operator box then we can simply choose the
attribute name which we want to clean then we choose the mode of replace as average.
11. DATA MINING CA-02 PAGE 12
After this process we moved forward to change the categorical variable to numeric in our dataset. As
we previously described we have a column name ocean proximity can train some categorical variable.
We use pythons label encoder command to change all that 5 data types to numeric. Completing the
transformation step the next step is about feature selection which described below.
3.2 Feature selection:
Feature selection is a import step in mining projects. We choose important features from dataset to
increase our prediction accuracy. Feature selection can be done by various methods like filter methods
and wrapped methods. In this project we have implemented filter method which can describe the co
relation between the attributes. The pictorial representation of feature selection can be shown below.
Fig[bivariant, univariant]
The above shown picture is a bi variant and univariant plot which describes the relation among
attributes. which can helpful further to choose right attribute for modeling purpose. However, we have
used a correlation plot for batter visualization. That can be shown below.
12. DATA MINING CA-02 PAGE 13
Fig [co-relation matrix]
After plotting the co relation matrix we can see that the first two column that are longitude and
latitude have either high or very low co relation with every attribute. except these two all other
attributes are in normal co relation with each other. However, we can see that the median house value
column has the all high co relation with every attribute. So, we can choose that column as the label
column to predict the price range. After this step we further described the steps we have used in rapid
miner for feature selection.
13. DATA MINING CA-02 PAGE 14
Steps in rapid miner:
We have used select attribute operator in rapid miner and choose the attribute we want to keep in
further process according to the above correlation plot. The picture of the operator and all the
attribute can be shown below.
Fig[select attribute rapidminer]
3.3 Data Splitting:
In this step we split our final dataset into test and train set. However, we split our dataset in both rapid
miner and python. You can see below that we have split our data in 70% and 30% proportion in
RapidMiner. on the other hand, we have also split our data in 80% and 20% in python.
Fig[data split python]
14. DATA MINING CA-02 PAGE 15
Fig[data split in rapid miner]
4.Modeling:
In the modelling phase we have implemented leaner regression. As I previously mentioned we have a
continuous data with median house value. So, in linear regression we have a target variable which
predict the value based on other independent variables which we set as x variable. In this dataset we
have our target variable y as median house value. on the other hand, we have our independent
variable as x. We have both propose linear regression in RapidMiner and python. However, in case of
Auto model, we found that gradient boosted tree has the highest accuracy with minimum run time. So
finally, we choose gradient boosted tree in rapid miner auto mode. All the process can be shown
below.
Fig[model python]
15. DATA MINING CA-02 PAGE 16
Fig[model Rapidminer]
Above picture shows that how we built the model in rapid miner. All the operators we have used to
build the model are described below.
Firstly, we have loaded the dataset in RapidMiner using retrieve operator. Furthermore, we have
investigated all the missing values with the help of data statistic option. likewise, we have set the
replace missing value operator to replace all the missing value with its average value. After this process
we have used the co relation matrix operator to show the co relation coefficients of different
attributes. Finally, we have set the role of the attribute median house value as label or target column.
Furthermore, we have split the data with split data operator. Lastly, we applied linear regression model
to fit the data with it.
4.1 Auto model:
When we talk about the auto model. Auto model selected various models but out of these models we
show that Gradient boosted tree gives us a reliable accuracy and first-time consuming output. So
finally, we select the gradient boosted tree. The result of GBT can be shown below.
16. DATA MINING CA-02 PAGE 17
Fig[Auto model]
Fig[Gradient Boosted Tree]
On the above picture we can see that the Gradient boosted tree has taken the root node as total rooms.
Where population and total bedroom as the subbranch of the tree to predict the price from the data.
17. DATA MINING CA-02 PAGE 18
5.Evaluation:
In the evaluation phase we have evaluate both of our Python and rapid miner model. We got a root mean
square error in python regression model. However, after comparing both the root mean square error in
python and RapidMiner we got to know that we have a higher root mean square error in rapid miner that is
90147.99 . Both of the picture is shown below.
Fig[Evaluation RapidMiner]
Fig[Evaluation python]
Finally, we found that there are some attributes which affected our model accuracy. As we have not chosen
total bedroom and income in our rapid miner and it shows less accuracy then python model. So, we can sure
that income and bedroom size can put effect on house value prediction.
18. DATA MINING CA-02 PAGE 19
6.Deployment:
Fig[prediction result]
After evaluation phase we deployed our model to compare the prediction price with the original price and to
find important insights from the prediction which can help to tackle the business problems. Here we can
clearly see that there is a huge price difference in our prediction model. However total bedroom and
population can affect our prediction hugely.
6.1Business outcomes:
According to the previously discussed business objectives we have the outcomes as follows.
We can see that total bedroom and population put a higher effect on price difference. However, we have not
taken the income attribute.
At first, we thought that location of a house can merely effect on the house price fluctuation but here from
the result we can conclude that. It has a very less effectiveness on price distribution.
Likewise, Age of the property is also has very negative effect on price distribution. It shown that some of the
old houses has higher price range rather than new houses, but the result is different in different prospects.
When we see that population is affect the house price when it combines positively with house age.
Conclusion:
We have coverup all the business problems by our model. Likewise, we have implemented all the steps of
Crisp_DM model. However, it was difficult to increase our accuracy of the model. As there are some outliers
present in our data. However, we had not taken the income attribute in our RapidMiner model. As we think is
an important attribute. In the further process it could increase the accuracy of our model.