This document summarizes an analysis of crime data from Portland, OR using various machine learning techniques. Key steps included engineering spatial and temporal features from open data sources, dimensionality reduction using PCA, and combining data using geohashing. A random forest model achieved an R2 of 0.93 on crime prediction, with poverty levels, commercial activity, and transportation factors being most important predictors. The final model was able to predict crime rates for a future 6-week period with high accuracy.
3. Traditional crime prediction
• Small data sets
• Temporal or spatial – not both
• Small number of features
Data-driven crime prediction
• Lack of targeted analysis techniques
• Temporal or spatial – not both
Our Goals
• Open source data
• Engineer spatial and temporal features
• Use targeted statistics
• Model and predict crime
• Understand those predictions
Crime: a technical challenge
4. Data: Portland OR
• Target: Calls for Service (location and time)
• Temporal: Features that distinguish days
• Spatial: Features that distinguish geography
Processing
• Data in Time: Time series
• Data in Space: Kriging
• Dimensionality reduction: PCA
• Combining data: Geohashing
Modeling
• Regression, ensemble
• Model comparison
• Feature importance
Crime: Portland
5. The Tool: Dataiku Data Science Studio
• End to end platform for predictive models
• Collaborative
• Connect the best of big data and data science
• Polyglot (SQL, Python, R, …)
• Production ready
• Featuring Spark + Esri ArcGis Online (maps) + custom data plugins
(census!)
7. • Actionable 911 calls – proxy for crimes
• Street crimes, burglary, motor vehicle theft
• Location (latitude, longitude), date
• ~1 million unique calls
(indexed by location and time)
• March 2012 – March 2017
• Available from the National Institute of
Justice
• geopoint x time x call features
The Data: calls for service
8. • OpenStreetMap.org, 2016
• 11 K points of interests
• Geom points
• Transport, entertainment,
restaurants, public services…
• geopoints x business features
The Data: points of interest
9. • 35,000 check-ins
• Name of business
• Category of business (7)
• Latitude and Longitude, Distance from city
center
• # check-ins, # unique users, tip amount
• geopoint x check-in-features
The Data: foursquare check-ins
10. • 60 precincts
• Spatial tiling –multipolygons
• geotile x precinct label
The Data: police precincts
11. • 20,000+ features
• 2013, 2014, 2015
• 600-3000 people per block group
• Spatial tiling – multipolyogons
• geo-tile x year x census features
The Data: US Census
12. • Daily samples, 2012-2016
• 14 weather stations around Portland,
sparsely sampled
• Temperature, precipitation, wind-
speed, presence of
snow/rain/thunder/sleet….
• NOAA weather API
• Time x weather features
The Data: weather
13. • Major holidays
• Political events
• time x label
The Data: holidays and events
14. Past Crime Events Weather Data Precincts Pts of Interest Census DataFour Square
How do we make this useful?
16. The process: time series
EventsPast Crime
Regularities over time to predict the future?
• Facebook’s Prophet (build in Stan, a probabilistic programming
language)
• Implemented in Python (ScikitLearn API)
• additive regression model
• piecewise linear or logistic growth curve trend.
• yearly seasonal component modeled using Fourier series.
• weekly seasonal component using dummy variables
• dummy coded holidays and events (impulse regressor)
19. • gstat package: spatial and spatial-temporal geostatistical
modeling
• sp package: classes and methods for Spatial Data
• Implemented in R
The process: spatial kriging
Can we infer the values of sparsely sampled spatial data?
• Gaussian process regression
• Interpolation and extrapolation
• Modeled using a Gaussian process with empirically
estimated covariance
• Assumes the correlation between two random variables
depends on the spatial distance between them
(independent of specific location)
21. The process: dimensionality reduction
How do we figure out which features are most useful, and not
overly correlated?
• Dataiku US Census Plug in
• PCA in python
• Data:
• Look up census block by lat-long
• Group number of crimes by year and census block
• Feature Selection: Regress and Rank
• Correlate each census feature against target
(variance across geography)
• Rank the features by their significance
• Top 5% of regressors for 2013, 2014, 2015
• All features that were sig. predictors of all 3 years
• Useful features – but maybe still correlated!
22. The process: dimensionality reduction
How do we figure out which features are most useful, and not
overly correlated?
• Dataiku US Census Plug in
• PCA in python
• Feature Engineering: Principal Components Analysis
Re-describe the data using fewer variables
• Find the dimension that explains the most variance in
your data
• Find the next most explanatory dimension that is
orthogonal to the first
• Stop when adding more dimensions doesn’t really help
• 5 top PCs
• Top 10 Regressors
23. Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Kriging Dim Reduct.
How can we put this into a common frame?
Census Data
26. Geohashing – combining your data
• Geohashing plug in (python)
• 17,000 hashes (~city block)
• PostGIS (postgreSQL)
• ST within -- points in grids
• ST accum – aggregate within grid
• ST union – new shape containing all points
27. Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Merge by
GeoHash
Kriging Dim Reduct.
Index by time =
Census Data
28. Data!
• 4.6 million records
• Index: Date (weekly) +
Geo grid (17K, ~one block)
• 61 features
• Model crime estimates (+6 weeks future)
• Weather features
• Police district features
• Points of interest
• Public use (check-ins)
• Census features
29. Modeling
Split the data
• Training set: random sample of ~1,000,000 records before
Aug 01 2015
• Testing set: all records after Aug 01, 2015 (~200,000)
Training
• ~45 minutes
• 3-fold cross validation
Models
• Random Forest
• XGBoost
• L1, L2 regression
Model comparison
• +/- location features
• +/- time features
31. Results: Model Performance
Random Forest! R2 = .93
• Split quality criterion: MSE
• Number of trees: 100 (with bootstrap)
• Max trees depth: 31
• Min samples per leaf: 10
• Min samples to split: 30
Better than
• Model with raw spatial and
temporal features
• Model without spatial features
• Model without temporal features
32. Results: Feature Importance
0% 4% 8% 12% 16%
# Shops
PCA 1
# Foursquare Checkins
Poverty Status of People living alone
# Foursquare Users (Total)
# Travel/Transport
Houses with no vehicle available
# Food Places
# Service Stores
PCA 3
PCA1
• Lived in a different house a year ago
• Household speaks limited English
• Main means of transportation is not a car
• Adults don’t speak English well or at all
• Children speak both English and Spanish
PCA3
• Household speaks language from Asia or
Pacific Islands
• % population born outside the US
33. Results: Correlation PCA 1
Poverty Status of People
PCA 3
# Service Stores
# Shops
# Foursquare Checkins
# Food Places
# Foursquare Users
No vehicles
# Travel/Transport
PCA1
Poverty
PCA3
#Service
#Shops
#Food
#FSUsers
Novehicles
#Transport
Crimes
35. Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Merge by
GeoHash
Kriging Dim Reduct.
Index by
time
=
Census Data
36. Data
• Open source, high dimensionality,
varied across space & time
Analyses
• Time series, kriging, dimensionality reduction
• Geohashing
• Machine learning
Conclusions
• Keep a diverse set of tools in your
data science tool box!
• Using targeted analysis methods will
improve model
• Poverty, frequent moves, lack of vehicles,
lack of English
lack of commercial infrastructure
predict crime rates
Crime: a social and technical challenge
Hinweis der Redaktion
Domestetic crime is a ubiquitous problem – costs the UK 124billion+ /year prevention, intervention, harm. Despite that, we know surprisingly little.
Being able to predict, and thus prevent crime, is incredible important. Crime causes billions in damages. However, getting a handle on when, and where, a crime will occur is a seriuous challenge. Depending on what indications you look at, crime rates may be increasing or decreasing, and everything from changes in political environment to the humidity content in the air has been sited as a causes of these fluctuations.
Crime presents not only a social and political challenge, but also a technical one. Traditional analyses suffer from a few issues: size and technique. Big data, however, doesn’t have all of the elements that we need
Portland Oregon, because the NIJ
Semi-variance
Variogram, fit with exponential model