Rent, Rain, and Regulations: Using Data to predict Crime

Rent, Rain, and Regulations
Using Data to predict Crime
Du Phan, Dataiku

Traditional crime prediction
• Small data sets
• Temporal or spatial – not both
• Small number of features
Data-driven crime prediction
• Lack of targeted analysis techniques
• Temporal or spatial – not both
Our Goals
• Open source data
• Engineer spatial and temporal features
• Use targeted statistics
• Model and predict crime
• Understand those predictions
Crime: a technical challenge

Data: Portland OR
• Target: Calls for Service (location and time)
• Temporal: Features that distinguish days
• Spatial: Features that distinguish geography
Processing
• Data in Time: Time series
• Data in Space: Kriging
• Dimensionality reduction: PCA
• Combining data: Geohashing
Modeling
• Regression, ensemble
• Model comparison
• Feature importance
Crime: Portland

The Tool: Dataiku Data Science Studio
• End to end platform for predictive models
• Collaborative
• Connect the best of big data and data science
• Polyglot (SQL, Python, R, …)
• Production ready
• Featuring Spark + Esri ArcGis Online (maps) + custom data plugins
(census!)

• Actionable 911 calls – proxy for crimes
• Street crimes, burglary, motor vehicle theft
• Location (latitude, longitude), date
• ~1 million unique calls
(indexed by location and time)
• March 2012 – March 2017
• Available from the National Institute of
Justice
• geopoint x time x call features
The Data: calls for service

• OpenStreetMap.org, 2016
• 11 K points of interests
• Geom points
• Transport, entertainment,
restaurants, public services…
• geopoints x business features
The Data: points of interest

• 35,000 check-ins
• Name of business
• Category of business (7)
• Latitude and Longitude, Distance from city
center
• # check-ins, # unique users, tip amount
• geopoint x check-in-features
The Data: foursquare check-ins

• 60 precincts
• Spatial tiling –multipolygons
• geotile x precinct label
The Data: police precincts

• 20,000+ features
• 2013, 2014, 2015
• 600-3000 people per block group
• Spatial tiling – multipolyogons
• geo-tile x year x census features
The Data: US Census

• Daily samples, 2012-2016
• 14 weather stations around Portland,
sparsely sampled
• Temperature, precipitation, wind-
speed, presence of
snow/rain/thunder/sleet….
• NOAA weather API
• Time x weather features
The Data: weather

• Major holidays
• Political events
• time x label
The Data: holidays and events

Past Crime Events Weather Data Precincts Pts of Interest Census DataFour Square
How do we make this useful?

The process: time series
EventsPast Crime
Regularities over time to predict the future?
• Facebook’s Prophet (build in Stan, a probabilistic programming
language)
• Implemented in Python (ScikitLearn API)
• additive regression model
• piecewise linear or logistic growth curve trend.
• yearly seasonal component modeled using Fourier series.
• weekly seasonal component using dummy variables
• dummy coded holidays and events (impulse regressor)

• gstat package: spatial and spatial-temporal geostatistical
modeling
• sp package: classes and methods for Spatial Data
• Implemented in R
The process: spatial kriging
Can we infer the values of sparsely sampled spatial data?
• Gaussian process regression
• Interpolation and extrapolation
• Modeled using a Gaussian process with empirically
estimated covariance
• Assumes the correlation between two random variables
depends on the spatial distance between them
(independent of specific location)

The process: spatial kriging
variance

The process: dimensionality reduction
How do we figure out which features are most useful, and not
overly correlated?
• Dataiku US Census Plug in
• PCA in python
• Data:
• Look up census block by lat-long
• Group number of crimes by year and census block
• Feature Selection: Regress and Rank
• Correlate each census feature against target
(variance across geography)
• Rank the features by their significance
• Top 5% of regressors for 2013, 2014, 2015
• All features that were sig. predictors of all 3 years
• Useful features – but maybe still correlated!

The process: dimensionality reduction
How do we figure out which features are most useful, and not
overly correlated?
• Dataiku US Census Plug in
• PCA in python
• Feature Engineering: Principal Components Analysis
Re-describe the data using fewer variables
• Find the dimension that explains the most variance in
your data
• Find the next most explanatory dimension that is
orthogonal to the first
• Stop when adding more dimensions doesn’t really help
• 5 top PCs
• Top 10 Regressors

Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Kriging Dim Reduct.
How can we put this into a common frame?
Census Data

Geohashing – combining your data

Geohashing – combining your data
• Geohashing plug in (python)
• 17,000 hashes (~city block)
• PostGIS (postgreSQL)
• ST within -- points in grids
• ST accum – aggregate within grid
• ST union – new shape containing all points

Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Merge by
GeoHash
Kriging Dim Reduct.
Index by time =
Census Data

Data!
• 4.6 million records
• Index: Date (weekly) +
Geo grid (17K, ~one block)
• 61 features
• Model crime estimates (+6 weeks future)
• Weather features
• Police district features
• Points of interest
• Public use (check-ins)
• Census features

Modeling
Split the data
• Training set: random sample of ~1,000,000 records before
Aug 01 2015
• Testing set: all records after Aug 01, 2015 (~200,000)
Training
• ~45 minutes
• 3-fold cross validation
Models
• Random Forest
• XGBoost
• L1, L2 regression
Model comparison
• +/- location features
• +/- time features

Results: Model Performance
Random Forest! R2 = .93
• Split quality criterion: MSE
• Number of trees: 100 (with bootstrap)
• Max trees depth: 31
• Min samples per leaf: 10
• Min samples to split: 30
Better than
• Model with raw spatial and
temporal features
• Model without spatial features
• Model without temporal features

Results: Feature Importance
0% 4% 8% 12% 16%
# Shops
PCA 1
# Foursquare Checkins
Poverty Status of People living alone
# Foursquare Users (Total)
# Travel/Transport
Houses with no vehicle available
# Food Places
# Service Stores
PCA 3
PCA1
• Lived in a different house a year ago
• Household speaks limited English
• Main means of transportation is not a car
• Adults don’t speak English well or at all
• Children speak both English and Spanish
PCA3
• Household speaks language from Asia or
Pacific Islands
• % population born outside the US

Results: Correlation PCA 1
Poverty Status of People
PCA 3
# Service Stores
# Shops
# Foursquare Checkins
# Food Places
# Foursquare Users
No vehicles
# Travel/Transport
PCA1
Poverty
PCA3
#Service
#Shops
#Food
#FSUsers
Novehicles
#Transport
Crimes

Results: Future Crime (March 15-April 1)

Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Merge by
GeoHash
Kriging Dim Reduct.
Index by
time
=
Census Data

Data
• Open source, high dimensionality,
varied across space & time
Analyses
• Time series, kriging, dimensionality reduction
• Geohashing
• Machine learning
Conclusions
• Keep a diverse set of tools in your
data science tool box!
• Using targeted analysis methods will
improve model
• Poverty, frequent moves, lack of vehicles,
lack of English
lack of commercial infrastructure
predict crime rates
Crime: a social and technical challenge

Rent, Rain, and Regulations: Using Data to predict Crime

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Rent, Rain, and Regulations: Using Data to predict Crime

Ähnlich wie Rent, Rain, and Regulations: Using Data to predict Crime (20)

Mehr von DataconomyGmbH

Mehr von DataconomyGmbH (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Rent, Rain, and Regulations: Using Data to predict Crime

Hinweis der Redaktion