This document summarizes a project analyzing trends in Jersey City condo prices. It explores using tax assessment and third party data to build models predicting condo prices. Attributes like building score, floor, view, and a price index are used. Linear regression, model trees, GBM and random forests are tested. Random forests had the best cross-validation RMSE at 64.49, outperforming other models. Future work could include more rigorous modeling and comparing to other price estimates.
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
THE KEY FACTORS THAT IMPACT JERSEY CITY CONDO PRICES
1. THE HACK ON JERSEY CITY CONDO PRICES
explore trends in public data
Yiqun “Yi” Wang / NYC Data Science Academy / Code for JC / March 2015
2. THE HACK ON JERSEY CITY CONDO PRICES
Outline of the Project
• Data
• Tax assessment data
• Third party data sources to join
• Data janitor and collection works
• Relationship Exploration
• Building attributes exploration
• Individual units price exploration
• Model for Prices
• 5 competing models
• Cross validation
3. THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data
NJ MOD IV System
Covers all individual properties
Downloadable in batch in text files
Key columns:
- Property address
- Property class
- Property size
- Year built
- Owner address
- Owner name
- Last sold price
- Last sold date
- Qualifier (can parse out condo floor # and unit #)
4. THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data
• 62,270 totals property records, as of Feb 2015
-- filter down to --
• 3,867 individual condo units (of 19 selected mid/high-rise buildings)
# step 1.1 tax data load from NJ MOD IV system
url <- "http://tax1.co.monmouth.nj.us/download/0906monm204610.zip"
download.file(url,"0906monm204610.zip",quiet = FALSE)
closeAllConnections()
unzip("0906monm204610.zip")
taxdata <- read.csv(file="0906monm204610.csv")
5. THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data is Dirty!
# address cleanup
taxdata$Property.Location <- gsub("STREET", "ST", taxdata$Property.Location)
taxdata$Property.Location <- gsub("BOULEVARD", "BLVD", taxdata$Property.Location)
taxdata$Property.Location <- gsub("[[:punct:]]", "", taxdata$Property.Location)
taxdata$Property.Location <- gsub("[[:space:]]", "", taxdata$Property.Location)
# drop bad prices, bad units, bad sf, bad records, retail condos
taxdata <- taxdata[taxdata$Sale.Price>10000,]
taxdata <- taxdata[taxdata$Sale.Price<=10000000,]
taxdata <- taxdata[!(taxdata$Qual=="" | is.null(taxdata$Qual)),]
taxdata <- taxdata[!(taxdata$Sq..Ft.=="" | is.null(taxdata$Sq..Ft.) |
taxdata$Sq..Ft.<=400 | taxdata$Sq..Ft.>=3000),]
taxdata <- taxdata[!(is.na(taxdata$Map.Page)),]
taxdata <- taxdata[!(taxdata$Building.Class=="C"),]
taxdata <- taxdata[!(substr(taxdata$Qual,4,4) %in% c("R","L","U")),]
6. THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data has Hidden Treasures
• From “Qual” we can parse out floor number and unit number
taxdata$Floor <-
ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“
substr(taxdata$Qual,3,4),substr(taxdata$Qual,2,3))
Floor <- ifelse(Floor=="PH",BuildingNumberOfStories,Floor)
taxdata$Unit <-
ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“
substr(taxdata$Qual,5,5),substr(taxdata$Qual,4,5))
7. THE HACK ON JERSEY CITY CONDO PRICES
Third Party Data Sources To Join
• Condo building attributes:
• http://livingonthehudson.com
• http://www.jcboe.org
• http://www.zillow.com
• http://www.streeteasy.com
• http://buyersadvisors.com
• Location primness:
• http://walkscore.com
• Building Geocode / Transit location:
• http://maps.google.com/maps/api/geocode/
• Census tract level demographics:
• http://geomap.ffiec.gov
8. THE HACK ON JERSEY CITY CONDO PRICES
Map Out All the Buildings
# step 1.4.3 map the buildings out
bldmaap <- ggmap(get_googlemap(
center='Grove Street, Jersey City, NJ', zoom=14,
maptype='roadmap'),extent='device') +
geom_point(data=bldgeoc, aes(x=lon, y=lat),colour='darkblue',
alpha=0.7,
na.rm=TRUE, size=5)
bldmaap
ggsave(filename="map.png",plot = last_plot(),width=3,height=3)
9. THE HACK ON JERSEY CITY CONDO PRICES
Building Attributes
For each condo building:
• Address / Lat / Lon
• Number of Units
• Number of Stories
• Year Built
• Walk Score
• Census Tract Median Household Income
• Distance to Water
• Distance to PATH (subway) Station
10. THE HACK ON JERSEY CITY CONDO PRICES
Building Data Table
11. THE HACK ON JERSEY CITY CONDO PRICES
Building Scoring System – PLSR is superior
# step 1.6 come up with a building primeness score using PCA/PLSR
bld.pcr <- pcr(BuildingPPSF ~
OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data =
blddata, validation = "CV")
bld.pls <- plsr(BuildingPPSF ~
OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data =
blddata, validation = "CV")
blddata$BuildingScore < predict(bld.pls,newdata=blddata)
PCA TRAINING: % variance explained
1 comps
X 35.29
BuildingPPSF 22.10
PLSR TRAINING: % variance explained
1 comps
X 29.58
BuildingPPSF 62.48
12. THE HACK ON JERSEY CITY CONDO PRICES
Building Scoring System – PLSR is superior
13. THE HACK ON JERSEY CITY CONDO PRICES
Some Cross Checking on Buildings
BuildingName calc.unit.count stated.unit.count
700 Grove 226 237
77 Hudson 407 420
Clermont Cove 97 NA
Crystal Point 257 269
Fulton's Landing 106 105
Gulls Cove 301 432
Liberty Terrace 116 118
Mandalay on the Hudson 250 269
Montgomery Greene 102 113
Pier House 99 180
Portofino 264 NA
Shore Club North 211 220
Shore Club South 214 220
Sugar House 48 65
The A Condominiums 238 250
The James Monroe 364 NA
Trump Plaza 391 445
Waldo Lofts 80 82
Zephyr Lofts 96 102
among 16 buildings with known units:
3,527 total units
3,142 units covered
89% coverage
14. THE HACK ON JERSEY CITY CONDO PRICES
Condo Unit Attributes
For each condo unit:
• Square Footage
• Sale Price
• Sale Date
• Floor
• Unit Number
• Building Score
15. THE HACK ON JERSEY CITY CONDO PRICES
Model the date dimension – price index
# step 2.3 checking condo price per sf over time (price
index)
aggu <- ddply(.data=findata[Sale.YrQtr>="1999 Q1" &
Sale.YrQtr<="2014 Q4" & !is.na(Sale.YrQtr) ,],
.variables='Sale.YrQtr',
summarize,
calc.avg.ppsf=mean(SalePrice/SqFt,na.rm=TRUE)
)
aggu$calc.avg.ppsf.r2q <-
append(rollmean(aggu$calc.avg.ppsf,
2),rep(NA,1),after=0)
aggu$calc.avg.ppsf.r4q <-
append(rollmean(aggu$calc.avg.ppsf,
4),rep(NA,3),after=0)
aggu$calc.avg.ppsf.r8q <-
append(rollmean(aggu$calc.avg.ppsf,
8),rep(NA,7),after=0)
16. THE HACK ON JERSEY CITY CONDO PRICES
Last Missing Variable: The View from Units
Manually entered:
using public-domain floor plan data and listing data
and consulting broker friends
Three categories:
2 – Great View
1 – Some View
0 – Nothing Special
In the future, can look for text description in listing:
- “Manhattan View”
- “Bay View”
- “Corner”
- etc.
17. THE HACK ON JERSEY CITY CONDO PRICES
Box Plot: Does View Matter?
18. THE HACK ON JERSEY CITY CONDO PRICES
Model the Price!
• Simple linear regression – one variable a time
• Multi linear regression
• Model Tree (Weka)
• Generalized Boosted Regression Models (gbm)
• Random Forest
• Cross validation all the models
19. THE HACK ON JERSEY CITY CONDO PRICES
Simple linear regression
# step 3.1 bi-variate linear model
findata <- read.csv("findata.csv")
modelLMSqFt <- lm(PPSF~SqFt)
summary(modelLMSqFt) #adjR2=0.0010, p<.05 OKAY NOT SIGNIFICANT
modelLMFloor <- lm(PPSF~Floor)
summary(modelLMFloor) #adjR2=0.1709, p<.05 GOOD
modelLMBuildingScore <- lm(PPSF~BuildingScore)
summary(modelLMBuildingScore) #adjR2=0.3584, p<.05 GOOD
modelLMView <- lm(PPSF~View)
summary(modelLMView) #adjR2=0.0959, p<.05 GOOD
modelLMIndex <- lm(PPSF~Index)
summary(modelLMIndex) #adjR2=0.1327, p<.05 GOOD
20. THE HACK ON JERSEY CITY CONDO PRICES
Multi-linear regression
# step 3.2 multi-variate linear model
modelLM <- lm(PPSF~SqFt+Floor+View+Index+BuildingScore
summary(modelLM) #adjR2=0.4602
PPSFHatLM <- predict(modelLM,findata)
RMSE(PPSFHatLM, PPSF, na.rm=TRUE) #99.17106
21. THE HACK ON JERSEY CITY CONDO PRICES
Model Tree
# step 3.3 model tree
modelMT <- M5P(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata)
summary(modelMT)
findata$PPSFHatMT <- predict(modelMT,findata)
RMSE(findata$PPSFHatMT, findata$PPSF, na.rm=TRUE) #86.63803
22. THE HACK ON JERSEY CITY CONDO PRICES
gbm
# step 3.4 gbm
findata <- read.csv("findata.csv")
modelGBM <- gbm(PPSF~SqFt+Floor+View+Index+BuildingScore,
data=findata,distribution="gaussian",n.trees=10000)
summary(modelGBM)
findata$PPSFHatGBM <- predict(modelGBM,newdata=findata,n.trees=10000)
RMSE(findata$PPSFHatGBM, findata$PPSF) #90.60208
23. THE HACK ON JERSEY CITY CONDO PRICES
Random Forest
# step 3.5 random forest
findata <- read.csv("findata.csv")
findata <- findata[!is.na(findata$Index),]
modelRF <- randomForest(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata)
summary(modelRF)
findata$PPSFHatRF <- predict(modelRF,newdata=findata)
RMSE(findata$PPSFHatRF, findata$PPSF) #64.49773
24. THE HACK ON JERSEY CITY CONDO PRICES
Cross Validation of All Models
# step 4.1 data partition
in_train <- createDataPartition(findata$PPSF, p=0.75, list=FALSE)
findata_train <- findata[in_train,]
findata_test <- findata[-in_train,]
rmse_cv <- function(k,train){
m <- nrow(train)
num <- sample(1:10,m,replace=T)
rmse <- numeric(10)
for (i in 1:10) {
data.t <- train[num!=i, ]
data.v <- train[num==i, ]
model <- <MODEL>(PPSF~SqFt+Floor+View+Index+BuildingScore,data=data.t)
pred <- predict(model,newdata=data.v)
rmse[i] <- RMSE(pred,data.v$PPSF)
}
return(mean(rmse))
}
rmse <- sapply(1:100,rmse_cv,findata_train)
25. THE HACK ON JERSEY CITY CONDO PRICES
Cross Validation Result
RMSE Total Universe Cross Validation
Multi-linear 99.17 99.55
Model Tree (Weka M5P) 86.63 78.46
GBM 90.60 91.67
RandomForest 64.49 82.19
26. THE HACK ON JERSEY CITY CONDO PRICES
Wish List Items…
• More rigorous regression diagnostics
• Tuning models better
• Model blending
• Compare with Zestimate