Weitere ähnliche Inhalte
Ähnlich wie Predicting_housing_prices_using_advanced.pdf
Ähnlich wie Predicting_housing_prices_using_advanced.pdf (20)
Kürzlich hochgeladen (20)
Predicting_housing_prices_using_advanced.pdf
- 1. A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
© 2019, www.IJARIIT.com All Rights Reserved Page | 370
ISSN: 2454-132X
Impact factor: 4.295
(Volume 5, Issue 1)
Available online at: www.ijariit.com
Predicting housing prices using advanced regression techniques
Bharathi A. N.
bharathinandhees1997@gmail.com
KPR Institute of Engineering and
Technology, Coimbatore, Tamil Nadu
Dr. N. Yuvaraj
drnyuvaraj@gmail.com
KPR Institute of Engineering and
Technology, Coimbatore, Tamil Nadu
Dhivya B.
dhivyakrishnan1998@gmail.com
KPR Institute of Engineering and
Technology, Coimbatore, Tamil Nadu
ABSTRACT
The prices of House increases every year, so there is a need
for the system to predict house prices in the future. House
price prediction can help the developer to determine the
selling price of a house. It also can help the customer to
arrange the right time to purchase a house. There are some
factors that influence the price of a house which depends on
physical conditions, concept, location and others. House
prices vary for each place and in different communities.
There are various techniques for predicting house prices. One
of the efficient ways is by the use of the regression technique.
Regression is a reliable method of identifying which variables
have an impact on a topic of interest. Random forests are very
accurate and robust to over-fitting. The process of performing
a regression allows to confidently determine which factors
matter the most, which factors can be ignored and how the
factors influence each other. The main objective is to use an
advanced methodology for prediction.
Keywords— House prices, Regression, Price prediction,
Lasso regression
1. INTRODUCTION
One of the business activity that most people are interested in
this globalization era is Investment. There are several objects
that are often used for investment, for example, gold, stocks
and property [1]. In determining the price of the home, the
developer must carefully calculate and determine the
appropriate method as the property prices always increase
continuously and almost never fall in the long or short term [2].
Prediction analysis is one among the several approaches that
can be used to determine the price of the house. It is a challenge
to get as close as a possible result based on the model built. For
a specific house price, it is determined by location, size, house
type, city, country, tax rules, economic cycle, population
movement, interest rate, and many other factors which could
affect demand and supply. For local house price prediction,
there are many useful regression algorithms to use. A set of
statistical processes for estimating the relationships among
variables is Regression analysis. It includes many techniques
for modeling and analyzing several variables when the focus is
on the relationship between a dependent variable and one or
more independent variables (or 'predictors').
Regression analysis, more specifically, helps one understand
how the typical value of the dependent variable changes when
any one of the independent variables is varied, while the other
independent variables are held fixed. One of the main
advantages of regression-based predicting techniques is that
they use research and analysis to predict what is likely to
happen in the next quarter, year or even farther into the future.
For small-business owners, regression-based forecasting can
provide insight into how higher taxes changes in consumer
spending or shifts in the local economy.
Regression and forecasting techniques can lend a scientific
angle to manage small businesses, reducing large amounts of
raw data to actionable information. The dataset taken has the
training set including 1460 houses (i.e., observations)
accompanied by 79 attributes (i.e., features, variables, or
predictors) and the sales price for each house. The testing
set includes 1459 houses with the same 79 attributes, but
the sales price was not included as this is our target
variable. In this paper, the proposed house price prediction is
based on the random forest algorithm.
2. LITERATURE SURVEY
In a study [3] conducted on the housing prices in the City of
Savannah, Georgia using the hedonic pricing model. The
paper’s data contains 2,888 single-family houses for the period
between 2000 and 2005. It shows that the log price of houses is
positively and significantly correlated with the number of
bathrooms, bedrooms, fireplaces, garage spaces, stories and the
total square feet of the house. Additionally, the paper adds three
dummy variables, May, June, and July, to account for the
seasonable factor with regards to the houses’ prices. If the
house is sold in May, the variable May is set to be equal to 1
and 0 otherwise. The other variables, June and July are
constructed in a similar fashion. The paper finds that the log
sale prices of houses are significantly and positively correlated
with May and July while June is insignificant. This implies that
houses that are closed in May or July tend to have a higher
price.
The social and economic impact of housing in the Scottish
countryside is examined. Investment in housing finance
impacts the economy directly and indirectly. The employment,
GDP, productivity and many other important factors are
- 2. A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
© 2019, www.IJARIIT.com All Rights Reserved Page | 371
impacted by Housing finance investment. The study revealed
that housing is an important Indicator for increasing the wealth
of nations. It was then concluded that the Scottish housing
policy objective is to improve the quality standard of housing
as well as to increase the investment in the house old sector.
In research [8] it is found that if significance level is accepted
as 0.05 all the 5 variables in a regression model (Floor, Heating
system, Earthquake Zone, Rental Value and Land Value) have
a significant impact on the dependent variable Value. Land
value and rental value have the highest impact on housing
price. Existing floor, heating system and earthquake zone are
the following them. Although it is found that the other variable
is not significant in the study, and it can change according to
the sample size. If the sample size increases, the regression
model once again is recommended for further studies. The
application of multiple regression analysis in a house data set
explains or model’s variation in house price which
demonstrated good examples of the strategic application of the
mathematical tool to aid analysis, hence decision making in
property investment. Variation in house price which
demonstrated good examples of the strategic application of the
mathematical tool to aid [5] (2010) uses support vector machine
(SVM) regression to forecast the housing prices in China in
between 1993 and 2002 and in a certain district in Tangshan
city in Between 2000 to 2002. The paper utilizes the genetic
algorithm to tune the hyper-parameters in the SVM regression
model. The error scores for the SVM regression model for both
China and a Tangshan City’s district are both lower than 4%.
This indicates that the SVM regression model performs well in
forecasting housing prices in China. In Singapore’s housing
market, (2006) decision tree model is used to study the housing
characteristics’ effects on prices [6]. The paper concludes that
the owners of 2-room to 4-room flats are more concerned with
the flats’ basic characteristics such as model type and age more
than the owners of 5-or-more-room flats. Moreover, owners of
executive flats care more about the services characteristics such
as the neighbourhood location and recreational facilities than
basic housing characteristics.
In a research 2014[7] relationships were developed between
various home characteristics and the asking price of a
residential property was analyzed using both a simple linear
regression and the multiple linear regression using a method of
ordinary least squares. Home square footage was utilized as the
explanatory variable in the simple linear regression, and the
multiple linear regression consisted of the addition of land size,
number of bedrooms, year of construction, and other
explanatory variables. The multiple linear regression results
proved the bias due to the omission of crucial factors in the
simple linear regression. It was found that Home square footage
was the most important factor in the determination of
residential property price, while garage capacity proved to be
the weakest factor.
Many previous studies find empirical evidence supporting the
significant interrelations between house price and various
economic variables, such as income, interest rates, construction
costs and labor market variables [8][9][10].
3. METHODS AND MATERIALS
There are various kinds of regression techniques available to
make predictions [11]. The techniques are mostly driven by
three metrics (number of independent variables, type of
dependent variables and shape of the regression line) which is
given in figure 1.
Various Algorithms used for the purpose of predicting Housing
prices are listed below.
Fig. 1: metrics of regression
3.1. Hedonic Pricing Model
Hedonic price theory assumes that a commodity such as a
house can be viewed as an aggregation of individual
components or attributes [12]. It is frequently used to measure a
property’s price. Hedonic pricing model combines both the
internal characteristics of a house(such as the number of
bedrooms, number of bathrooms, etc.) and its external
characteristic (such as neighbourhood’s walkability score,
public schools’ scores, etc.) to estimate its values. Hedonic
pricing can be implemented using the regression models.
Equation 1 will show the regression model in determining a
price.
𝑦 = 𝑎. 𝑥1 + b. 𝑥2 + ⋯ + n. 𝑥1 (1)
Where, y is the predicted price, and x1, x2, xi are the attributes
of a house. While a, b,... n indicate the correlation coefficients
of each variable in the determination of house prices. While the
hedonic technique is an acceptable method for accommodating
attribute differences of a house price determination model, it is
generally unrealistic to deal with the housing market in any
geographical area as a single unit. Therefore, it seems more
reasonable to introduce geographical information or location
factor into a model that allows shifts in the house price level.
3.2. Artificial Neural Network Model
The use of the neural network model is similar to the process
utilized in building the hedonic price model. However, the
neural network [13] must first be trained from a set of data. For
a particular input, the output (estimated house price) is
produced from the model. Then, the model compares the model
output to the actual output (actual house price). The accuracy of
the value is determined by the total mean square error and then
backpropagation is used in an attempt to reduce prediction
errors, which is done through the adjusting of the connection
weights. The performance [14] of the network can be
influenced by the number of hidden layers and the number of
nodes that are included in each hidden layer. A trial and error
process is applied to finding the optimal artificial neural
network model. It's far complicated than many other models,
such as decision tree and regression. It's hard to interpret and
understand the weights.
4. PROPOSED METHODOLOGY
4.1. Dataset and Preprocessing
There are two different data sets namely train dataset and test
dataset. Both contain numerous variables in terms of features
which were describing a house. Training dataset contains 1460
observations for which the sale price of a house is provided.
Based on this data, a prediction model is to be built. Test
dataset contains 1459 observations for which the sales price has
to be predicted. 80 variables in total focus on the quality and
quantity of many physical attributes of the property. Most of
the variables are exactly the type of information that a typical
- 3. A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
© 2019, www.IJARIIT.com All Rights Reserved Page | 372
home buyer would have to know about a potential property.
This study is based on house price data of Ames
Housing dataset.
Some of these features of the dataset don’t have a linear
relationship with the house price such as ‘date’, ‘long’ and ‘lat’
representing the date the house was sold, the longitude and the
latitude of the house, respectively. These features should either
be removed or modified. First, using ‘date’ (the date the house
was sold) and ‘yr built’ (the year the house was built), we
calculate the age of the building. Using the feature ‘yr
renovated’ (the year the house was renovated) we create a new
binary feature to represent whether the house was renovated at
all. Although zip-code doesn’t have a linear relation with the
price, it could have useful information about the house price.
Hence it is treated as a categorical feature. Next, the features
‘id’, ‘date’, ‘yr built’, ‘lat’, ‘long’, ‘date yr’ and ‘yr renovated’
are removed.
4.2. Lasso Regression
In machine learning and statistics, lasso (least absolute
shrinkage and selection operator; also Lasso or LASSO) is
a regression analysis method that performs both variable
selection and regularization in order to enhance the prediction
accuracy and interpretability of the statistical model it
produces.
Lasso is a powerful regression technique. It works by
penalizing the magnitude of coefficients of features along with
minimizing the error between predicted and actual
observations. Lasso is called as L1 Regularization technique.
The algorithm can be implemented with the help of python’s
SciKit-learn Library [15]. Lasso attempts to minimize the cost
function. The cost function is given as Cost(W)= RSS(W) + α
(Sum of squares of weight) Here RSS refers to ‘Residual Sum
of Squares’ meaning the sum of the square of errors between
the predicted and actual values in the training data set. α is a co-
efficient that takes various values. There are three cases for
values of α.
1. α = 0; same coefficients as simple linear regression
2. α = ∞; All coefficients zero
3. 0 < α < ∞; coefficients between 0 and that of simple linear
regression The Lasso function can be
Cost (w) = ∑{
𝑁
𝑖=1
𝑦𝑖 − ∑ 𝑤𝑖
𝑀
𝑗=0
𝑥𝑖𝑗}2
+ 𝛼 ∑ |𝑤𝑖
𝑀
𝑗=0
|
.
The model can solve many of the challenges that we face with
linear regression and can be a very useful tool for fitting linear
models. It’s a better way to analyze data and capture
relationships in the data and avoid over-fitting.
4.3. House Price Affecting Factors
There are several factors that affect house prices. In research
[16] the factors affecting the house price are divided into three
main groups, they are physical condition, concept and location.
Physical conditions are properties possessed by a house that can
be observed by human senses, including the size of the house,
the number of bedrooms, the availability of kitchen and garage,
the availability of the garden, the area of land and buildings,
and the age of the house [17], while the concept is an idea
offered by developers who can attract potential buyers, for
example, the concept of a minimalist home, healthy and green
environment, and elite environment. Location is an important
factor in shaping the price of a house. This is because the
location determines the prevailing land price [18]. In addition,
the location also determines the ease of access to public
facilities, such as schools, campus, hospitals and health centres,
as well as family recreation facilities such as malls, culinary
tours, or even offer a beautiful scenery [19], [20].
4.4. XgBoost
XGBoost has become a widely used and really popular tool
among Kaggle competitors and Data Scientists in industry, as it
has been battle tested for production on large-scale problems. It
is a highly flexible and versatile tool that can work through
most regression, classification and ranking problems as well as
user-built objective functions. As open-source software, it is
easy to access and it may be used through different platforms
and interfaces. The portability and compatibility of the system
permit its usage on all three Windows, Linux and OS X. It also
supports training on distributed cloud platforms like AWS,
Azure, GCE among others and it is easily connected to large-
scale cloud dataflow systems such as Flink and Spark.
Although it was built and initially used in the Command Line
Interface (CLI) by its creator, it can also be loaded and used in
various languages and interfaces such as Python, C++, R, Julia,
Scala and Java.
XGBoost is an accurate and scalable implementation of
gradient boosting machines. Its name stands for eXtreme
Gradient Boosting; it was developed by Tianqi Chen and now it
is part of a wider collection of open-source libraries developed
by the Distributed Machine Learning Community (DMLC). It
has proven to push the limits of computing power for boosted
trees algorithms as it was built and developed for the sole
purpose of computational speed and model performance.
Specifically, it was engineered to exploit every bit of a memory
and hardware resources for tree boosting algorithms.
The implementation of XGBoost offers several advanced
features for tuning of models, computing environments and
algorithm enhancement. It is capable of performing the three
main forms of gradient boosting (such as Gradient Boosting
(GB), Stochastic GB and Regularized GB) and it is robust
enough to support fine-tuning and the addition of regularization
parameters. According to Tianqi Chen, the latter is what makes
it superior and different from other libraries. System-wise, the
library’s portability and flexibility allow the use of a wide
variety of computing environments like parallelization for tree
construction across several CPU cores; Out-of-Core computing;
distributed computing for large models; and Cache
Optimization to improve hardware usage and efficiency.
The algorithm was developed to efficiently reduce computing
time and allocate an optimal usage of memory resources.
Important features of implementation include handling of
missing values (Sparse Aware), Block Structure to support
parallelization in tree construction and the ability to fit and
boost on new data added to a trained model. It holds various
methodologies and steps in the prediction method.
5. WORKING MODEL
Fig. 2: Steps involved for prediction
- 4. A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
© 2019, www.IJARIIT.com All Rights Reserved Page | 373
a) Reading data: At this stage, the data is read. The training
data is then needed to be concatenated with test data. This is
done mainly because of the presence of text variables. These
will later be replaced by dummy variables. If training and test
set is treated separately, it could end up with a different number
of dummy variables for each of them which would in turn
damage the prediction.
b) Data Preprocessing: It is a process of transforming the raw,
complex data into systematic understandable knowledge. It
involves the process of finding out missing and redundant data
in the dataset. The entire dataset is checked for Na and
whichever observation consists of Na will be deleted. Thus, this
brings uniformity in the dataset. Finally, the data has to be split
into training and test data.
c) Data Analysis: Before applying any model to our dataset,
we need to find out the characteristics of our dataset. Thus, we
need to analyze our dataset and study the different parameters
and relationship between these parameters. We can also find
out the outliers present in our dataset. Outliers occur due to
some kind of experimental errors and they need to be excluded
from the dataset.
d) Feature Engineering: Feature (variable or predictor)
engineering is one of the most important steps in model
creation. Often there is valuable information “hidden” in the
predictors that are only revealed when manipulating these
features in some way. Below are just some examples of the
features:
Remodeled (categorical): Yes or No if Year Built is
different from Year Remodeled; if the year the house was
remodeled is different from the year it was built, the
remodeling likely increases property value.
Seasonality (categorical): Combined Month Sold with Year
Sold; while more houses were sold during summer months,
this likely varies across years, especially during the time
period these houses were sold, which coincides with the
housing crash.
New House (categorical): Yes or No if Year Sold is equal
to Year Built; if a house was sold the same year it was
built, we might expect it was in high demand and might
have a higher Sale Price.
Total Area (continuous): Sum of all variables that describe
the area of different sections of a house; There are many
variables that pertain to the square footage of different
aspects of each house; we might expect that the total
square footage has a strong influence on Sale Price.
e) Modelling: Model selection is the process of combining data
and prior information to select among a group of statistical
models. In building a model, decisions to include or exclude
covariates, as well as uncertainty in how to code the covariates
in the design matrix for any given model, are based both on the
prior hypotheses and the data. Lasso (least absolute shrinkage
and selection operator; also Lasso or LASSO) is a regression
analysis method that performs both variable
selection and regularization in order to enhance the prediction
accuracy and interpretability of the statistical model it
produces.
6. CONCLUSION
In this paper, the LASSO regression technique was
implemented to predict the price of a house. The step by step
procedure to analyze the dataset and find the correlation
between the parameters are mentioned. Thus we can select the
parameters which are not correlated to each other and are
independent in nature and these feature set were then given as
an input. It performs both variable selection and regularization
in order to enhance the prediction accuracy.
7. REFERENCES
[1] R. M. A. van der Schaar, Analysis of Indonesian Property
Market; Overview and Foreign Ownership,‖ Investment
Indonesian. 2015.
[2] Y. Feng and K. Jones, Comparing multilevel modelling
and artificial neural networks in house price prediction,‖
2015 2nd IEEE Int. Conf. Spat. Data Min. Geogr. Knowl.
Serv., pp. 108–114, 2015.
[3] Rochard J. Cebula. “The Hedonic Pricing Model Applied
to the Housing Market of the City of Savannah and Its
Savannah Historic Landmark District”. In: The Review of
Regional Studies 39.1 (2009), pp. 9–22.
[4] [Gang-Zhi Fan, Seow Eng Ong, and Hian Chye Koh.
“Determinants of House Price: A Decision Tree
Approach”. In: Urban Studies 43.12 (2006)
[5] Gu Jirong, Zhu Mingcang, and Jiang Liuguangyan.
“Housing price based on genetic algorithm and support
vector machine”. In: Expert Systems with Applications 38
(2011), pp. 3383–3386.
[6] Eric Slone, Haitian Sun, Po-Hsiang Wang, (2014), “Market
Prices of Houses in Atlanta”, from
https://smartech.gatech.edu/bitstream/handle/1853/51632/
Market%20Prices%20of%20Houses%20in%20Atlanta.pdf
[7] P. Linneman, An empirical test of the efficiency of the
housing market‖. Journal of Urban Economics 20(1986):
140-154, 1986.
[8] J.M. Quigley, Real estate prices and economic cycles‖.
International Real Estate Reviews 2: 1-20. 1999.
[9] K.Tsatasaronis, & H. Zhu, What drives housing price
dynamics: Cross-country evidence?‖ BIS Quarterly Review
of March.
[10]Torgo, Luis, and Joao Gama. "Regression using
classification algorithms." Intelligent Data Analysis 1.4
(1997): 275-2.
[11] Ezgi Candas, Seda Bagdatli Kalkan and Tahsin
Yomralioglu, (2015), “Determining the Factors Affecting
Housing Prices”, FIG Working Week 2015, Sofia,
Bulgaria, 17 - 21 May 2015.
[12] Razi, Muhammad A., and KuriakoseAthappilly. "A
comparative predictive analysis of neural networks (NNs),
nonlinear regression and classification and regression tree
(CART) models." Expert Systems with Applications 29.1
(2005): 65-74.
[13]Lenk M. M., Worzala E. M. and A. Silva, 1997, “High-
tech Valuation: Should Artificial Neural Networks Bypass
The Human Valuer?”, Journal of Property Valuation &
Investment, 15(1): 8 – 26.
[14] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning
in Python." Journal of machine learning research 12.Oct
(2011): 2825-2830.
[15] R. A. Rahadi, S. K. Wiryono, D. P. Koesrindartotoor, and
I. B. Syamwil, Factors influencing the price of housing in
Indonesia,‖ Int. J. Hous. Mark. Anal., vol. 8, no. 2, pp.
169–188, 2015.
[16]V. Limsombunchai, House price prediction: Hedonic price
model vs. artificial neural network,‖ Am. J. …, 2004.
[17]D. X. Zhu and K. L. Wei, The Land Prices and Housing
Prices Empirical Research Based on Panel Data of 11
Provinces and Municipalities in Eastern China,‖ Int. Conf.
Manag. Sci. Eng., no. 2009, pp. 2118–2123, 2013.
- 5. A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
© 2019, www.IJARIIT.com All Rights Reserved Page | 374
[18]S. Kisilevich, D. Keim, and L. Rokach, ―A GIS-based
decision support system for hotel room rate estimation and
temporal price prediction: The hotel brokers’ context,‖
Decis. Support Syst., vol. 54, no. 2, pp. 1119– 1133, 2013.
[19]C. Y. Jim and W. Y. Chen, ―Value of scenic views:
Hedonic assessment of private housing in Hong Kong,‖
Landsc. Urban Plan., vol. 91, no. 4, pp. 226–234, 2009.