SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
THE HACK ON JERSEY CITY CONDO PRICES
explore trends in public data
Yiqun “Yi” Wang / NYC Data Science Academy / Code for JC / March 2015
THE HACK ON JERSEY CITY CONDO PRICES
Outline of the Project
• Data
• Tax assessment data
• Third party data sources to join
• Data janitor and collection works
• Relationship Exploration
• Building attributes exploration
• Individual units price exploration
• Model for Prices
• 5 competing models
• Cross validation
THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data
NJ MOD IV System
Covers all individual properties
Downloadable in batch in text files
Key columns:
- Property address
- Property class
- Property size
- Year built
- Owner address
- Owner name
- Last sold price
- Last sold date
- Qualifier (can parse out condo floor # and unit #)
THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data
• 62,270 totals property records, as of Feb 2015
-- filter down to --
• 3,867 individual condo units (of 19 selected mid/high-rise buildings)
# step 1.1 tax data load from NJ MOD IV system
url <- "http://tax1.co.monmouth.nj.us/download/0906monm204610.zip"
download.file(url,"0906monm204610.zip",quiet = FALSE)
closeAllConnections()
unzip("0906monm204610.zip")
taxdata <- read.csv(file="0906monm204610.csv")
THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data is Dirty!
# address cleanup
taxdata$Property.Location <- gsub("STREET", "ST", taxdata$Property.Location)
taxdata$Property.Location <- gsub("BOULEVARD", "BLVD", taxdata$Property.Location)
taxdata$Property.Location <- gsub("[[:punct:]]", "", taxdata$Property.Location)
taxdata$Property.Location <- gsub("[[:space:]]", "", taxdata$Property.Location)
# drop bad prices, bad units, bad sf, bad records, retail condos
taxdata <- taxdata[taxdata$Sale.Price>10000,]
taxdata <- taxdata[taxdata$Sale.Price<=10000000,]
taxdata <- taxdata[!(taxdata$Qual=="" | is.null(taxdata$Qual)),]
taxdata <- taxdata[!(taxdata$Sq..Ft.=="" | is.null(taxdata$Sq..Ft.) |
taxdata$Sq..Ft.<=400 | taxdata$Sq..Ft.>=3000),]
taxdata <- taxdata[!(is.na(taxdata$Map.Page)),]
taxdata <- taxdata[!(taxdata$Building.Class=="C"),]
taxdata <- taxdata[!(substr(taxdata$Qual,4,4) %in% c("R","L","U")),]
THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data has Hidden Treasures
• From “Qual” we can parse out floor number and unit number
taxdata$Floor <-
ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“
substr(taxdata$Qual,3,4),substr(taxdata$Qual,2,3))
Floor <- ifelse(Floor=="PH",BuildingNumberOfStories,Floor)
taxdata$Unit <-
ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“
substr(taxdata$Qual,5,5),substr(taxdata$Qual,4,5))
THE HACK ON JERSEY CITY CONDO PRICES
Third Party Data Sources To Join
• Condo building attributes:
• http://livingonthehudson.com
• http://www.jcboe.org
• http://www.zillow.com
• http://www.streeteasy.com
• http://buyersadvisors.com
• Location primness:
• http://walkscore.com
• Building Geocode / Transit location:
• http://maps.google.com/maps/api/geocode/
• Census tract level demographics:
• http://geomap.ffiec.gov
THE HACK ON JERSEY CITY CONDO PRICES
Map Out All the Buildings
# step 1.4.3 map the buildings out
bldmaap <- ggmap(get_googlemap(
center='Grove Street, Jersey City, NJ', zoom=14,
maptype='roadmap'),extent='device') +
geom_point(data=bldgeoc, aes(x=lon, y=lat),colour='darkblue',
alpha=0.7,
na.rm=TRUE, size=5)
bldmaap
ggsave(filename="map.png",plot = last_plot(),width=3,height=3)
THE HACK ON JERSEY CITY CONDO PRICES
Building Attributes
For each condo building:
• Address / Lat / Lon
• Number of Units
• Number of Stories
• Year Built
• Walk Score
• Census Tract Median Household Income
• Distance to Water
• Distance to PATH (subway) Station
THE HACK ON JERSEY CITY CONDO PRICES
Building Data Table
THE HACK ON JERSEY CITY CONDO PRICES
Building Scoring System – PLSR is superior
# step 1.6 come up with a building primeness score using PCA/PLSR
bld.pcr <- pcr(BuildingPPSF ~
OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data =
blddata, validation = "CV")
bld.pls <- plsr(BuildingPPSF ~
OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data =
blddata, validation = "CV")
blddata$BuildingScore < predict(bld.pls,newdata=blddata)
PCA TRAINING: % variance explained
1 comps
X 35.29
BuildingPPSF 22.10
PLSR TRAINING: % variance explained
1 comps
X 29.58
BuildingPPSF 62.48
THE HACK ON JERSEY CITY CONDO PRICES
Building Scoring System – PLSR is superior
THE HACK ON JERSEY CITY CONDO PRICES
Some Cross Checking on Buildings
BuildingName calc.unit.count stated.unit.count
700 Grove 226 237
77 Hudson 407 420
Clermont Cove 97 NA
Crystal Point 257 269
Fulton's Landing 106 105
Gulls Cove 301 432
Liberty Terrace 116 118
Mandalay on the Hudson 250 269
Montgomery Greene 102 113
Pier House 99 180
Portofino 264 NA
Shore Club North 211 220
Shore Club South 214 220
Sugar House 48 65
The A Condominiums 238 250
The James Monroe 364 NA
Trump Plaza 391 445
Waldo Lofts 80 82
Zephyr Lofts 96 102
among 16 buildings with known units:
3,527 total units
3,142 units covered
89% coverage
THE HACK ON JERSEY CITY CONDO PRICES
Condo Unit Attributes
For each condo unit:
• Square Footage
• Sale Price
• Sale Date
• Floor
• Unit Number
• Building Score
THE HACK ON JERSEY CITY CONDO PRICES
Model the date dimension – price index
# step 2.3 checking condo price per sf over time (price
index)
aggu <- ddply(.data=findata[Sale.YrQtr>="1999 Q1" &
Sale.YrQtr<="2014 Q4" & !is.na(Sale.YrQtr) ,],
.variables='Sale.YrQtr',
summarize,
calc.avg.ppsf=mean(SalePrice/SqFt,na.rm=TRUE)
)
aggu$calc.avg.ppsf.r2q <-
append(rollmean(aggu$calc.avg.ppsf,
2),rep(NA,1),after=0)
aggu$calc.avg.ppsf.r4q <-
append(rollmean(aggu$calc.avg.ppsf,
4),rep(NA,3),after=0)
aggu$calc.avg.ppsf.r8q <-
append(rollmean(aggu$calc.avg.ppsf,
8),rep(NA,7),after=0)
THE HACK ON JERSEY CITY CONDO PRICES
Last Missing Variable: The View from Units
Manually entered:
using public-domain floor plan data and listing data
and consulting broker friends
Three categories:
2 – Great View
1 – Some View
0 – Nothing Special
In the future, can look for text description in listing:
- “Manhattan View”
- “Bay View”
- “Corner”
- etc.
THE HACK ON JERSEY CITY CONDO PRICES
Box Plot: Does View Matter?
THE HACK ON JERSEY CITY CONDO PRICES
Model the Price!
• Simple linear regression – one variable a time
• Multi linear regression
• Model Tree (Weka)
• Generalized Boosted Regression Models (gbm)
• Random Forest
• Cross validation all the models
THE HACK ON JERSEY CITY CONDO PRICES
Simple linear regression
# step 3.1 bi-variate linear model
findata <- read.csv("findata.csv")
modelLMSqFt <- lm(PPSF~SqFt)
summary(modelLMSqFt) #adjR2=0.0010, p<.05 OKAY NOT SIGNIFICANT
modelLMFloor <- lm(PPSF~Floor)
summary(modelLMFloor) #adjR2=0.1709, p<.05 GOOD
modelLMBuildingScore <- lm(PPSF~BuildingScore)
summary(modelLMBuildingScore) #adjR2=0.3584, p<.05 GOOD
modelLMView <- lm(PPSF~View)
summary(modelLMView) #adjR2=0.0959, p<.05 GOOD
modelLMIndex <- lm(PPSF~Index)
summary(modelLMIndex) #adjR2=0.1327, p<.05 GOOD
THE HACK ON JERSEY CITY CONDO PRICES
Multi-linear regression
# step 3.2 multi-variate linear model
modelLM <- lm(PPSF~SqFt+Floor+View+Index+BuildingScore
summary(modelLM) #adjR2=0.4602
PPSFHatLM <- predict(modelLM,findata)
RMSE(PPSFHatLM, PPSF, na.rm=TRUE) #99.17106
THE HACK ON JERSEY CITY CONDO PRICES
Model Tree
# step 3.3 model tree
modelMT <- M5P(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata)
summary(modelMT)
findata$PPSFHatMT <- predict(modelMT,findata)
RMSE(findata$PPSFHatMT, findata$PPSF, na.rm=TRUE) #86.63803
THE HACK ON JERSEY CITY CONDO PRICES
gbm
# step 3.4 gbm
findata <- read.csv("findata.csv")
modelGBM <- gbm(PPSF~SqFt+Floor+View+Index+BuildingScore,
data=findata,distribution="gaussian",n.trees=10000)
summary(modelGBM)
findata$PPSFHatGBM <- predict(modelGBM,newdata=findata,n.trees=10000)
RMSE(findata$PPSFHatGBM, findata$PPSF) #90.60208
THE HACK ON JERSEY CITY CONDO PRICES
Random Forest
# step 3.5 random forest
findata <- read.csv("findata.csv")
findata <- findata[!is.na(findata$Index),]
modelRF <- randomForest(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata)
summary(modelRF)
findata$PPSFHatRF <- predict(modelRF,newdata=findata)
RMSE(findata$PPSFHatRF, findata$PPSF) #64.49773
THE HACK ON JERSEY CITY CONDO PRICES
Cross Validation of All Models
# step 4.1 data partition
in_train <- createDataPartition(findata$PPSF, p=0.75, list=FALSE)
findata_train <- findata[in_train,]
findata_test <- findata[-in_train,]
rmse_cv <- function(k,train){
m <- nrow(train)
num <- sample(1:10,m,replace=T)
rmse <- numeric(10)
for (i in 1:10) {
data.t <- train[num!=i, ]
data.v <- train[num==i, ]
model <- <MODEL>(PPSF~SqFt+Floor+View+Index+BuildingScore,data=data.t)
pred <- predict(model,newdata=data.v)
rmse[i] <- RMSE(pred,data.v$PPSF)
}
return(mean(rmse))
}
rmse <- sapply(1:100,rmse_cv,findata_train)
THE HACK ON JERSEY CITY CONDO PRICES
Cross Validation Result
RMSE Total Universe Cross Validation
Multi-linear 99.17 99.55
Model Tree (Weka M5P) 86.63 78.46
GBM 90.60 91.67
RandomForest 64.49 82.19
THE HACK ON JERSEY CITY CONDO PRICES
Wish List Items…
• More rigorous regression diagnostics
• Tuning models better
• Model blending
• Compare with Zestimate
THANK YOU!
yiqun.wang@nyu.edu

Weitere ähnliche Inhalte

Andere mochten auch

Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subwayVivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Vivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycVivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycVivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Lean Startup Metrics & Analytics
Lean Startup Metrics & AnalyticsLean Startup Metrics & Analytics
Lean Startup Metrics & AnalyticsNicola Junior Vitto
 

Andere mochten auch (17)

Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Xgboost
XgboostXgboost
Xgboost
 
From Lag to Lead: Actionable Analytics
From Lag to Lead: Actionable AnalyticsFrom Lag to Lead: Actionable Analytics
From Lag to Lead: Actionable Analytics
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Xgboost
XgboostXgboost
Xgboost
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Lean Startup Metrics & Analytics
Lean Startup Metrics & AnalyticsLean Startup Metrics & Analytics
Lean Startup Metrics & Analytics
 

Ähnlich wie THE KEY FACTORS THAT IMPACT JERSEY CITY CONDO PRICES

Data pre-processing and Exploration on 2016 Melbourne housing market by using R
Data pre-processing and Exploration on 2016 Melbourne housing market by using RData pre-processing and Exploration on 2016 Melbourne housing market by using R
Data pre-processing and Exploration on 2016 Melbourne housing market by using RShuaiGao3
 
Work in TDW
Work in TDWWork in TDW
Work in TDWsaso70
 
Cis 5200presentation groupb
Cis 5200presentation groupbCis 5200presentation groupb
Cis 5200presentation groupbNarendra Mali
 
GOBUILK Manila by SCG : Digital Construction Community for the Philippines
GOBUILK Manila by SCG : Digital Construction Community for the PhilippinesGOBUILK Manila by SCG : Digital Construction Community for the Philippines
GOBUILK Manila by SCG : Digital Construction Community for the PhilippinesBuilk Thailand
 
Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器vito jeng
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RStacy Irwin
 
Blockchain technology-in-fin tech - Anton Sitnikov
Blockchain technology-in-fin tech - Anton SitnikovBlockchain technology-in-fin tech - Anton Sitnikov
Blockchain technology-in-fin tech - Anton SitnikovDataFest Tbilisi
 
Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Ed Thewlis
 
N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0Keshav Murthy
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5Keshav Murthy
 
The Process and Toolkit of Layer 2 Mechanism Design
The Process and Toolkit of Layer 2 Mechanism DesignThe Process and Toolkit of Layer 2 Mechanism Design
The Process and Toolkit of Layer 2 Mechanism DesignBrandon Ramirez
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkDataWorks Summit
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Big Data Spain
 
On Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataOn Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataShima Zahmatkesh
 
Logistic Regression, Linear and Quadratic Discriminant Analysis and K-Nearest...
Logistic Regression, Linear and Quadratic Discriminant Analysis and K-Nearest...Logistic Regression, Linear and Quadratic Discriminant Analysis and K-Nearest...
Logistic Regression, Linear and Quadratic Discriminant Analysis and K-Nearest...Tarek Dib
 

Ähnlich wie THE KEY FACTORS THAT IMPACT JERSEY CITY CONDO PRICES (20)

Data pre-processing and Exploration on 2016 Melbourne housing market by using R
Data pre-processing and Exploration on 2016 Melbourne housing market by using RData pre-processing and Exploration on 2016 Melbourne housing market by using R
Data pre-processing and Exploration on 2016 Melbourne housing market by using R
 
PythonCERR_2014
PythonCERR_2014PythonCERR_2014
PythonCERR_2014
 
Work in TDW
Work in TDWWork in TDW
Work in TDW
 
Cis 5200presentation groupb
Cis 5200presentation groupbCis 5200presentation groupb
Cis 5200presentation groupb
 
Final
FinalFinal
Final
 
GOBUILK Manila by SCG : Digital Construction Community for the Philippines
GOBUILK Manila by SCG : Digital Construction Community for the PhilippinesGOBUILK Manila by SCG : Digital Construction Community for the Philippines
GOBUILK Manila by SCG : Digital Construction Community for the Philippines
 
Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Blockchain technology-in-fin tech - Anton Sitnikov
Blockchain technology-in-fin tech - Anton SitnikovBlockchain technology-in-fin tech - Anton Sitnikov
Blockchain technology-in-fin tech - Anton Sitnikov
 
Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)
 
N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 
The Process and Toolkit of Layer 2 Mechanism Design
The Process and Toolkit of Layer 2 Mechanism DesignThe Process and Toolkit of Layer 2 Mechanism Design
The Process and Toolkit of Layer 2 Mechanism Design
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
 
CompCruncher CVR Sample Valuation
CompCruncher CVR Sample Valuation CompCruncher CVR Sample Valuation
CompCruncher CVR Sample Valuation
 
On Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataOn Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed Data
 
Logistic Regression, Linear and Quadratic Discriminant Analysis and K-Nearest...
Logistic Regression, Linear and Quadratic Discriminant Analysis and K-Nearest...Logistic Regression, Linear and Quadratic Discriminant Analysis and K-Nearest...
Logistic Regression, Linear and Quadratic Discriminant Analysis and K-Nearest...
 
SQL
SQLSQL
SQL
 

Mehr von Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Vivian S. Zhang
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Vivian S. Zhang
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...Vivian S. Zhang
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...Vivian S. Zhang
 

Mehr von Vivian S. Zhang (12)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
 

THE KEY FACTORS THAT IMPACT JERSEY CITY CONDO PRICES

  • 1. THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data Yiqun “Yi” Wang / NYC Data Science Academy / Code for JC / March 2015
  • 2. THE HACK ON JERSEY CITY CONDO PRICES Outline of the Project • Data • Tax assessment data • Third party data sources to join • Data janitor and collection works • Relationship Exploration • Building attributes exploration • Individual units price exploration • Model for Prices • 5 competing models • Cross validation
  • 3. THE HACK ON JERSEY CITY CONDO PRICES Tax Assessment Data NJ MOD IV System Covers all individual properties Downloadable in batch in text files Key columns: - Property address - Property class - Property size - Year built - Owner address - Owner name - Last sold price - Last sold date - Qualifier (can parse out condo floor # and unit #)
  • 4. THE HACK ON JERSEY CITY CONDO PRICES Tax Assessment Data • 62,270 totals property records, as of Feb 2015 -- filter down to -- • 3,867 individual condo units (of 19 selected mid/high-rise buildings) # step 1.1 tax data load from NJ MOD IV system url <- "http://tax1.co.monmouth.nj.us/download/0906monm204610.zip" download.file(url,"0906monm204610.zip",quiet = FALSE) closeAllConnections() unzip("0906monm204610.zip") taxdata <- read.csv(file="0906monm204610.csv")
  • 5. THE HACK ON JERSEY CITY CONDO PRICES Tax Assessment Data is Dirty! # address cleanup taxdata$Property.Location <- gsub("STREET", "ST", taxdata$Property.Location) taxdata$Property.Location <- gsub("BOULEVARD", "BLVD", taxdata$Property.Location) taxdata$Property.Location <- gsub("[[:punct:]]", "", taxdata$Property.Location) taxdata$Property.Location <- gsub("[[:space:]]", "", taxdata$Property.Location) # drop bad prices, bad units, bad sf, bad records, retail condos taxdata <- taxdata[taxdata$Sale.Price>10000,] taxdata <- taxdata[taxdata$Sale.Price<=10000000,] taxdata <- taxdata[!(taxdata$Qual=="" | is.null(taxdata$Qual)),] taxdata <- taxdata[!(taxdata$Sq..Ft.=="" | is.null(taxdata$Sq..Ft.) | taxdata$Sq..Ft.<=400 | taxdata$Sq..Ft.>=3000),] taxdata <- taxdata[!(is.na(taxdata$Map.Page)),] taxdata <- taxdata[!(taxdata$Building.Class=="C"),] taxdata <- taxdata[!(substr(taxdata$Qual,4,4) %in% c("R","L","U")),]
  • 6. THE HACK ON JERSEY CITY CONDO PRICES Tax Assessment Data has Hidden Treasures • From “Qual” we can parse out floor number and unit number taxdata$Floor <- ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“ substr(taxdata$Qual,3,4),substr(taxdata$Qual,2,3)) Floor <- ifelse(Floor=="PH",BuildingNumberOfStories,Floor) taxdata$Unit <- ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“ substr(taxdata$Qual,5,5),substr(taxdata$Qual,4,5))
  • 7. THE HACK ON JERSEY CITY CONDO PRICES Third Party Data Sources To Join • Condo building attributes: • http://livingonthehudson.com • http://www.jcboe.org • http://www.zillow.com • http://www.streeteasy.com • http://buyersadvisors.com • Location primness: • http://walkscore.com • Building Geocode / Transit location: • http://maps.google.com/maps/api/geocode/ • Census tract level demographics: • http://geomap.ffiec.gov
  • 8. THE HACK ON JERSEY CITY CONDO PRICES Map Out All the Buildings # step 1.4.3 map the buildings out bldmaap <- ggmap(get_googlemap( center='Grove Street, Jersey City, NJ', zoom=14, maptype='roadmap'),extent='device') + geom_point(data=bldgeoc, aes(x=lon, y=lat),colour='darkblue', alpha=0.7, na.rm=TRUE, size=5) bldmaap ggsave(filename="map.png",plot = last_plot(),width=3,height=3)
  • 9. THE HACK ON JERSEY CITY CONDO PRICES Building Attributes For each condo building: • Address / Lat / Lon • Number of Units • Number of Stories • Year Built • Walk Score • Census Tract Median Household Income • Distance to Water • Distance to PATH (subway) Station
  • 10. THE HACK ON JERSEY CITY CONDO PRICES Building Data Table
  • 11. THE HACK ON JERSEY CITY CONDO PRICES Building Scoring System – PLSR is superior # step 1.6 come up with a building primeness score using PCA/PLSR bld.pcr <- pcr(BuildingPPSF ~ OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data = blddata, validation = "CV") bld.pls <- plsr(BuildingPPSF ~ OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data = blddata, validation = "CV") blddata$BuildingScore < predict(bld.pls,newdata=blddata) PCA TRAINING: % variance explained 1 comps X 35.29 BuildingPPSF 22.10 PLSR TRAINING: % variance explained 1 comps X 29.58 BuildingPPSF 62.48
  • 12. THE HACK ON JERSEY CITY CONDO PRICES Building Scoring System – PLSR is superior
  • 13. THE HACK ON JERSEY CITY CONDO PRICES Some Cross Checking on Buildings BuildingName calc.unit.count stated.unit.count 700 Grove 226 237 77 Hudson 407 420 Clermont Cove 97 NA Crystal Point 257 269 Fulton's Landing 106 105 Gulls Cove 301 432 Liberty Terrace 116 118 Mandalay on the Hudson 250 269 Montgomery Greene 102 113 Pier House 99 180 Portofino 264 NA Shore Club North 211 220 Shore Club South 214 220 Sugar House 48 65 The A Condominiums 238 250 The James Monroe 364 NA Trump Plaza 391 445 Waldo Lofts 80 82 Zephyr Lofts 96 102 among 16 buildings with known units: 3,527 total units 3,142 units covered 89% coverage
  • 14. THE HACK ON JERSEY CITY CONDO PRICES Condo Unit Attributes For each condo unit: • Square Footage • Sale Price • Sale Date • Floor • Unit Number • Building Score
  • 15. THE HACK ON JERSEY CITY CONDO PRICES Model the date dimension – price index # step 2.3 checking condo price per sf over time (price index) aggu <- ddply(.data=findata[Sale.YrQtr>="1999 Q1" & Sale.YrQtr<="2014 Q4" & !is.na(Sale.YrQtr) ,], .variables='Sale.YrQtr', summarize, calc.avg.ppsf=mean(SalePrice/SqFt,na.rm=TRUE) ) aggu$calc.avg.ppsf.r2q <- append(rollmean(aggu$calc.avg.ppsf, 2),rep(NA,1),after=0) aggu$calc.avg.ppsf.r4q <- append(rollmean(aggu$calc.avg.ppsf, 4),rep(NA,3),after=0) aggu$calc.avg.ppsf.r8q <- append(rollmean(aggu$calc.avg.ppsf, 8),rep(NA,7),after=0)
  • 16. THE HACK ON JERSEY CITY CONDO PRICES Last Missing Variable: The View from Units Manually entered: using public-domain floor plan data and listing data and consulting broker friends Three categories: 2 – Great View 1 – Some View 0 – Nothing Special In the future, can look for text description in listing: - “Manhattan View” - “Bay View” - “Corner” - etc.
  • 17. THE HACK ON JERSEY CITY CONDO PRICES Box Plot: Does View Matter?
  • 18. THE HACK ON JERSEY CITY CONDO PRICES Model the Price! • Simple linear regression – one variable a time • Multi linear regression • Model Tree (Weka) • Generalized Boosted Regression Models (gbm) • Random Forest • Cross validation all the models
  • 19. THE HACK ON JERSEY CITY CONDO PRICES Simple linear regression # step 3.1 bi-variate linear model findata <- read.csv("findata.csv") modelLMSqFt <- lm(PPSF~SqFt) summary(modelLMSqFt) #adjR2=0.0010, p<.05 OKAY NOT SIGNIFICANT modelLMFloor <- lm(PPSF~Floor) summary(modelLMFloor) #adjR2=0.1709, p<.05 GOOD modelLMBuildingScore <- lm(PPSF~BuildingScore) summary(modelLMBuildingScore) #adjR2=0.3584, p<.05 GOOD modelLMView <- lm(PPSF~View) summary(modelLMView) #adjR2=0.0959, p<.05 GOOD modelLMIndex <- lm(PPSF~Index) summary(modelLMIndex) #adjR2=0.1327, p<.05 GOOD
  • 20. THE HACK ON JERSEY CITY CONDO PRICES Multi-linear regression # step 3.2 multi-variate linear model modelLM <- lm(PPSF~SqFt+Floor+View+Index+BuildingScore summary(modelLM) #adjR2=0.4602 PPSFHatLM <- predict(modelLM,findata) RMSE(PPSFHatLM, PPSF, na.rm=TRUE) #99.17106
  • 21. THE HACK ON JERSEY CITY CONDO PRICES Model Tree # step 3.3 model tree modelMT <- M5P(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata) summary(modelMT) findata$PPSFHatMT <- predict(modelMT,findata) RMSE(findata$PPSFHatMT, findata$PPSF, na.rm=TRUE) #86.63803
  • 22. THE HACK ON JERSEY CITY CONDO PRICES gbm # step 3.4 gbm findata <- read.csv("findata.csv") modelGBM <- gbm(PPSF~SqFt+Floor+View+Index+BuildingScore, data=findata,distribution="gaussian",n.trees=10000) summary(modelGBM) findata$PPSFHatGBM <- predict(modelGBM,newdata=findata,n.trees=10000) RMSE(findata$PPSFHatGBM, findata$PPSF) #90.60208
  • 23. THE HACK ON JERSEY CITY CONDO PRICES Random Forest # step 3.5 random forest findata <- read.csv("findata.csv") findata <- findata[!is.na(findata$Index),] modelRF <- randomForest(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata) summary(modelRF) findata$PPSFHatRF <- predict(modelRF,newdata=findata) RMSE(findata$PPSFHatRF, findata$PPSF) #64.49773
  • 24. THE HACK ON JERSEY CITY CONDO PRICES Cross Validation of All Models # step 4.1 data partition in_train <- createDataPartition(findata$PPSF, p=0.75, list=FALSE) findata_train <- findata[in_train,] findata_test <- findata[-in_train,] rmse_cv <- function(k,train){ m <- nrow(train) num <- sample(1:10,m,replace=T) rmse <- numeric(10) for (i in 1:10) { data.t <- train[num!=i, ] data.v <- train[num==i, ] model <- <MODEL>(PPSF~SqFt+Floor+View+Index+BuildingScore,data=data.t) pred <- predict(model,newdata=data.v) rmse[i] <- RMSE(pred,data.v$PPSF) } return(mean(rmse)) } rmse <- sapply(1:100,rmse_cv,findata_train)
  • 25. THE HACK ON JERSEY CITY CONDO PRICES Cross Validation Result RMSE Total Universe Cross Validation Multi-linear 99.17 99.55 Model Tree (Weka M5P) 86.63 78.46 GBM 90.60 91.67 RandomForest 64.49 82.19
  • 26. THE HACK ON JERSEY CITY CONDO PRICES Wish List Items… • More rigorous regression diagnostics • Tuning models better • Model blending • Compare with Zestimate