SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Rent, Rain, and Regulations
Using Data to predict Crime
Du Phan, Dataiku
Crime
Traditional crime prediction
• Small data sets
• Temporal or spatial – not both
• Small number of features
Data-driven crime prediction
• Lack of targeted analysis techniques
• Temporal or spatial – not both
Our Goals
• Open source data
• Engineer spatial and temporal features
• Use targeted statistics
• Model and predict crime
• Understand those predictions
Crime: a technical challenge
Data: Portland OR
• Target: Calls for Service (location and time)
• Temporal: Features that distinguish days
• Spatial: Features that distinguish geography
Processing
• Data in Time: Time series
• Data in Space: Kriging
• Dimensionality reduction: PCA
• Combining data: Geohashing
Modeling
• Regression, ensemble
• Model comparison
• Feature importance
Crime: Portland
The Tool: Dataiku Data Science Studio
• End to end platform for predictive models
• Collaborative
• Connect the best of big data and data science
• Polyglot (SQL, Python, R, …)
• Production ready
• Featuring Spark + Esri ArcGis Online (maps) + custom data plugins
(census!)
The Data
• Actionable 911 calls – proxy for crimes
• Street crimes, burglary, motor vehicle theft
• Location (latitude, longitude), date
• ~1 million unique calls
(indexed by location and time)
• March 2012 – March 2017
• Available from the National Institute of
Justice
• geopoint x time x call features
The Data: calls for service
• OpenStreetMap.org, 2016
• 11 K points of interests
• Geom points
• Transport, entertainment,
restaurants, public services…
• geopoints x business features
The Data: points of interest
• 35,000 check-ins
• Name of business
• Category of business (7)
• Latitude and Longitude, Distance from city
center
• # check-ins, # unique users, tip amount
• geopoint x check-in-features
The Data: foursquare check-ins
• 60 precincts
• Spatial tiling –multipolygons
• geotile x precinct label
The Data: police precincts
• 20,000+ features
• 2013, 2014, 2015
• 600-3000 people per block group
• Spatial tiling – multipolyogons
• geo-tile x year x census features
The Data: US Census
• Daily samples, 2012-2016
• 14 weather stations around Portland,
sparsely sampled
• Temperature, precipitation, wind-
speed, presence of
snow/rain/thunder/sleet….
• NOAA weather API
• Time x weather features
The Data: weather
• Major holidays
• Political events
• time x label
The Data: holidays and events
Past Crime Events Weather Data Precincts Pts of Interest Census DataFour Square
How do we make this useful?
The Process
The process: time series
EventsPast Crime
Regularities over time to predict the future?
• Facebook’s Prophet (build in Stan, a probabilistic programming
language)
• Implemented in Python (ScikitLearn API)
• additive regression model
• piecewise linear or logistic growth curve trend.
• yearly seasonal component modeled using Fourier series.
• weekly seasonal component using dummy variables
• dummy coded holidays and events (impulse regressor)
The process: time series
The process: time series
• gstat package: spatial and spatial-temporal geostatistical
modeling
• sp package: classes and methods for Spatial Data
• Implemented in R
The process: spatial kriging
Can we infer the values of sparsely sampled spatial data?
• Gaussian process regression
• Interpolation and extrapolation
• Modeled using a Gaussian process with empirically
estimated covariance
• Assumes the correlation between two random variables
depends on the spatial distance between them
(independent of specific location)
The process: spatial kriging
variance
The process: dimensionality reduction
How do we figure out which features are most useful, and not
overly correlated?
• Dataiku US Census Plug in
• PCA in python
• Data:
• Look up census block by lat-long
• Group number of crimes by year and census block
• Feature Selection: Regress and Rank
• Correlate each census feature against target
(variance across geography)
• Rank the features by their significance
• Top 5% of regressors for 2013, 2014, 2015
• All features that were sig. predictors of all 3 years
• Useful features – but maybe still correlated!
The process: dimensionality reduction
How do we figure out which features are most useful, and not
overly correlated?
• Dataiku US Census Plug in
• PCA in python
• Feature Engineering: Principal Components Analysis
Re-describe the data using fewer variables
• Find the dimension that explains the most variance in
your data
• Find the next most explanatory dimension that is
orthogonal to the first
• Stop when adding more dimensions doesn’t really help
• 5 top PCs
• Top 10 Regressors
Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Kriging Dim Reduct.
How can we put this into a common frame?
Census Data
Combining and Modeling
Geohashing – combining your data
Geohashing – combining your data
• Geohashing plug in (python)
• 17,000 hashes (~city block)
• PostGIS (postgreSQL)
• ST within -- points in grids
• ST accum – aggregate within grid
• ST union – new shape containing all points
Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Merge by
GeoHash
Kriging Dim Reduct.
Index by time =
Census Data
Data!
• 4.6 million records
• Index: Date (weekly) +
Geo grid (17K, ~one block)
• 61 features
• Model crime estimates (+6 weeks future)
• Weather features
• Police district features
• Points of interest
• Public use (check-ins)
• Census features
Modeling
Split the data
• Training set: random sample of ~1,000,000 records before
Aug 01 2015
• Testing set: all records after Aug 01, 2015 (~200,000)
Training
• ~45 minutes
• 3-fold cross validation
Models
• Random Forest
• XGBoost
• L1, L2 regression
Model comparison
• +/- location features
• +/- time features
Results
Results: Model Performance
Random Forest! R2 = .93
• Split quality criterion: MSE
• Number of trees: 100 (with bootstrap)
• Max trees depth: 31
• Min samples per leaf: 10
• Min samples to split: 30
Better than
• Model with raw spatial and
temporal features
• Model without spatial features
• Model without temporal features
Results: Feature Importance
0% 4% 8% 12% 16%
# Shops
PCA 1
# Foursquare Checkins
Poverty Status of People living alone
# Foursquare Users (Total)
# Travel/Transport
Houses with no vehicle available
# Food Places
# Service Stores
PCA 3
PCA1
• Lived in a different house a year ago
• Household speaks limited English
• Main means of transportation is not a car
• Adults don’t speak English well or at all
• Children speak both English and Spanish
PCA3
• Household speaks language from Asia or
Pacific Islands
• % population born outside the US
Results: Correlation PCA 1
Poverty Status of People
PCA 3
# Service Stores
# Shops
# Foursquare Checkins
# Food Places
# Foursquare Users
No vehicles
# Travel/Transport
PCA1
Poverty
PCA3
#Service
#Shops
#Food
#FSUsers
Novehicles
#Transport
Crimes
Results: Future Crime (March 15-April 1)
Past Crime Events
Time Series
Geo x Modeled Crime
Weather Data
Time by Weather
Precincts
Geo by Police
Pts of Interest
Geo by Pt Interest
Census Data
Geo by Census
Four Square
Geo by Use
Merge by
GeoHash
Kriging Dim Reduct.
Index by
time
=
Census Data
Data
• Open source, high dimensionality,
varied across space & time
Analyses
• Time series, kriging, dimensionality reduction
• Geohashing
• Machine learning
Conclusions
• Keep a diverse set of tools in your
data science tool box!
• Using targeted analysis methods will
improve model
• Poverty, frequent moves, lack of vehicles,
lack of English
lack of commercial infrastructure
predict crime rates
Crime: a social and technical challenge

Weitere ähnliche Inhalte

Ähnlich wie Rent, Rain, and Regulations: Using Data to predict Crime

Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysAerospike, Inc.
 
How to Manage Open Police Data - Tips for Data QA/QC and Automation
How to Manage Open Police Data - Tips for Data QA/QC and AutomationHow to Manage Open Police Data - Tips for Data QA/QC and Automation
How to Manage Open Police Data - Tips for Data QA/QC and AutomationSafe Software
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaSpark Summit
 
Spark summit europe 2015 magellan
Spark summit europe 2015 magellanSpark summit europe 2015 magellan
Spark summit europe 2015 magellanRam Sriharsha
 
Android mobile based Field Data Collection
Android mobile based Field Data CollectionAndroid mobile based Field Data Collection
Android mobile based Field Data Collectionpraxisnfp
 
The Use of GIS in Local Government - The City of Monash
The Use of GIS in Local Government - The City of MonashThe Use of GIS in Local Government - The City of Monash
The Use of GIS in Local Government - The City of MonashSteven Truman
 
Reforming Traditional Machine Learning Algorithms with Spatio-Temporal Analy...
 Reforming Traditional Machine Learning Algorithms with Spatio-Temporal Analy... Reforming Traditional Machine Learning Algorithms with Spatio-Temporal Analy...
Reforming Traditional Machine Learning Algorithms with Spatio-Temporal Analy...Databricks
 
Data quality challenges in the Canadensys network of occurrence records: exam...
Data quality challenges in the Canadensys network of occurrence records: exam...Data quality challenges in the Canadensys network of occurrence records: exam...
Data quality challenges in the Canadensys network of occurrence records: exam...kristgen
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational StatisticsSetia Pramana
 
Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Dimitri van Hees
 
ATS-16: Making Data Count, Krista Nordback
ATS-16: Making Data Count, Krista NordbackATS-16: Making Data Count, Krista Nordback
ATS-16: Making Data Count, Krista NordbackBTAOregon
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Amazon Web Services
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesPiet J.H. Daas
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at UberSudhir Tonse
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit
 
Extracting City Traffic Events from Social Streams
 Extracting City Traffic Events from Social Streams Extracting City Traffic Events from Social Streams
Extracting City Traffic Events from Social StreamsPramod Anantharam
 

Ähnlich wie Rent, Rain, and Regulations: Using Data to predict Crime (20)

Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
How to Manage Open Police Data - Tips for Data QA/QC and Automation
How to Manage Open Police Data - Tips for Data QA/QC and AutomationHow to Manage Open Police Data - Tips for Data QA/QC and Automation
How to Manage Open Police Data - Tips for Data QA/QC and Automation
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
Opportunities for alternative data sources
Opportunities for alternative data sourcesOpportunities for alternative data sources
Opportunities for alternative data sources
 
Spark summit europe 2015 magellan
Spark summit europe 2015 magellanSpark summit europe 2015 magellan
Spark summit europe 2015 magellan
 
Android mobile based Field Data Collection
Android mobile based Field Data CollectionAndroid mobile based Field Data Collection
Android mobile based Field Data Collection
 
The Use of GIS in Local Government - The City of Monash
The Use of GIS in Local Government - The City of MonashThe Use of GIS in Local Government - The City of Monash
The Use of GIS in Local Government - The City of Monash
 
Reforming Traditional Machine Learning Algorithms with Spatio-Temporal Analy...
 Reforming Traditional Machine Learning Algorithms with Spatio-Temporal Analy... Reforming Traditional Machine Learning Algorithms with Spatio-Temporal Analy...
Reforming Traditional Machine Learning Algorithms with Spatio-Temporal Analy...
 
Env. mon
Env. monEnv. mon
Env. mon
 
Data quality challenges in the Canadensys network of occurrence records: exam...
Data quality challenges in the Canadensys network of occurrence records: exam...Data quality challenges in the Canadensys network of occurrence records: exam...
Data quality challenges in the Canadensys network of occurrence records: exam...
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational Statistics
 
Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3
 
ATS-16: Making Data Count, Krista Nordback
ATS-16: Making Data Count, Krista NordbackATS-16: Making Data Count, Krista Nordback
ATS-16: Making Data Count, Krista Nordback
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Extracting City Traffic Events from Social Streams
 Extracting City Traffic Events from Social Streams Extracting City Traffic Events from Social Streams
Extracting City Traffic Events from Social Streams
 

Mehr von DataconomyGmbH

The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18DataconomyGmbH
 
Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18DataconomyGmbH
 
Accessing Online Text-based conversation | Jay Krall | DN2018
Accessing Online Text-based conversation | Jay Krall | DN2018Accessing Online Text-based conversation | Jay Krall | DN2018
Accessing Online Text-based conversation | Jay Krall | DN2018DataconomyGmbH
 
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...DataconomyGmbH
 
Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18DataconomyGmbH
 
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...DataconomyGmbH
 
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18DataconomyGmbH
 
Building a Data Science Consultancy | Bart Smeets | DN18
Building a Data Science Consultancy | Bart Smeets | DN18Building a Data Science Consultancy | Bart Smeets | DN18
Building a Data Science Consultancy | Bart Smeets | DN18DataconomyGmbH
 
Building Sustainable Machine Learning Products for Communities, by Communit...
Building Sustainable Machine Learning Products  for Communities,  by Communit...Building Sustainable Machine Learning Products  for Communities,  by Communit...
Building Sustainable Machine Learning Products for Communities, by Communit...DataconomyGmbH
 
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
BIG DATA  is DEAD | Marc Weimer-Hablitzel, Etventure | DN18BIG DATA  is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18DataconomyGmbH
 
Undermining democracy | Alisa Kolesnikova | DN18
Undermining  democracy | Alisa Kolesnikova | DN18Undermining  democracy | Alisa Kolesnikova | DN18
Undermining democracy | Alisa Kolesnikova | DN18DataconomyGmbH
 
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18DataconomyGmbH
 
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDERLinked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDERDataconomyGmbH
 
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDERLiving in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDERDataconomyGmbH
 
Are You Ready for the Quickening!
Are You Ready for the Quickening!Are You Ready for the Quickening!
Are You Ready for the Quickening!DataconomyGmbH
 

Mehr von DataconomyGmbH (15)

The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 
Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18Technical debt in ML | Jaroslaw Szymczak | DN18
Technical debt in ML | Jaroslaw Szymczak | DN18
 
Accessing Online Text-based conversation | Jay Krall | DN2018
Accessing Online Text-based conversation | Jay Krall | DN2018Accessing Online Text-based conversation | Jay Krall | DN2018
Accessing Online Text-based conversation | Jay Krall | DN2018
 
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
 
Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18
 
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
 
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
 
Building a Data Science Consultancy | Bart Smeets | DN18
Building a Data Science Consultancy | Bart Smeets | DN18Building a Data Science Consultancy | Bart Smeets | DN18
Building a Data Science Consultancy | Bart Smeets | DN18
 
Building Sustainable Machine Learning Products for Communities, by Communit...
Building Sustainable Machine Learning Products  for Communities,  by Communit...Building Sustainable Machine Learning Products  for Communities,  by Communit...
Building Sustainable Machine Learning Products for Communities, by Communit...
 
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
BIG DATA  is DEAD | Marc Weimer-Hablitzel, Etventure | DN18BIG DATA  is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
 
Undermining democracy | Alisa Kolesnikova | DN18
Undermining  democracy | Alisa Kolesnikova | DN18Undermining  democracy | Alisa Kolesnikova | DN18
Undermining democracy | Alisa Kolesnikova | DN18
 
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
 
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDERLinked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
 
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDERLiving in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
 
Are You Ready for the Quickening!
Are You Ready for the Quickening!Are You Ready for the Quickening!
Are You Ready for the Quickening!
 

Kürzlich hochgeladen

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Kürzlich hochgeladen (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Rent, Rain, and Regulations: Using Data to predict Crime

  • 1. Rent, Rain, and Regulations Using Data to predict Crime Du Phan, Dataiku
  • 3. Traditional crime prediction • Small data sets • Temporal or spatial – not both • Small number of features Data-driven crime prediction • Lack of targeted analysis techniques • Temporal or spatial – not both Our Goals • Open source data • Engineer spatial and temporal features • Use targeted statistics • Model and predict crime • Understand those predictions Crime: a technical challenge
  • 4. Data: Portland OR • Target: Calls for Service (location and time) • Temporal: Features that distinguish days • Spatial: Features that distinguish geography Processing • Data in Time: Time series • Data in Space: Kriging • Dimensionality reduction: PCA • Combining data: Geohashing Modeling • Regression, ensemble • Model comparison • Feature importance Crime: Portland
  • 5. The Tool: Dataiku Data Science Studio • End to end platform for predictive models • Collaborative • Connect the best of big data and data science • Polyglot (SQL, Python, R, …) • Production ready • Featuring Spark + Esri ArcGis Online (maps) + custom data plugins (census!)
  • 7. • Actionable 911 calls – proxy for crimes • Street crimes, burglary, motor vehicle theft • Location (latitude, longitude), date • ~1 million unique calls (indexed by location and time) • March 2012 – March 2017 • Available from the National Institute of Justice • geopoint x time x call features The Data: calls for service
  • 8. • OpenStreetMap.org, 2016 • 11 K points of interests • Geom points • Transport, entertainment, restaurants, public services… • geopoints x business features The Data: points of interest
  • 9. • 35,000 check-ins • Name of business • Category of business (7) • Latitude and Longitude, Distance from city center • # check-ins, # unique users, tip amount • geopoint x check-in-features The Data: foursquare check-ins
  • 10. • 60 precincts • Spatial tiling –multipolygons • geotile x precinct label The Data: police precincts
  • 11. • 20,000+ features • 2013, 2014, 2015 • 600-3000 people per block group • Spatial tiling – multipolyogons • geo-tile x year x census features The Data: US Census
  • 12. • Daily samples, 2012-2016 • 14 weather stations around Portland, sparsely sampled • Temperature, precipitation, wind- speed, presence of snow/rain/thunder/sleet…. • NOAA weather API • Time x weather features The Data: weather
  • 13. • Major holidays • Political events • time x label The Data: holidays and events
  • 14. Past Crime Events Weather Data Precincts Pts of Interest Census DataFour Square How do we make this useful?
  • 16. The process: time series EventsPast Crime Regularities over time to predict the future? • Facebook’s Prophet (build in Stan, a probabilistic programming language) • Implemented in Python (ScikitLearn API) • additive regression model • piecewise linear or logistic growth curve trend. • yearly seasonal component modeled using Fourier series. • weekly seasonal component using dummy variables • dummy coded holidays and events (impulse regressor)
  • 19. • gstat package: spatial and spatial-temporal geostatistical modeling • sp package: classes and methods for Spatial Data • Implemented in R The process: spatial kriging Can we infer the values of sparsely sampled spatial data? • Gaussian process regression • Interpolation and extrapolation • Modeled using a Gaussian process with empirically estimated covariance • Assumes the correlation between two random variables depends on the spatial distance between them (independent of specific location)
  • 20. The process: spatial kriging variance
  • 21. The process: dimensionality reduction How do we figure out which features are most useful, and not overly correlated? • Dataiku US Census Plug in • PCA in python • Data: • Look up census block by lat-long • Group number of crimes by year and census block • Feature Selection: Regress and Rank • Correlate each census feature against target (variance across geography) • Rank the features by their significance • Top 5% of regressors for 2013, 2014, 2015 • All features that were sig. predictors of all 3 years • Useful features – but maybe still correlated!
  • 22. The process: dimensionality reduction How do we figure out which features are most useful, and not overly correlated? • Dataiku US Census Plug in • PCA in python • Feature Engineering: Principal Components Analysis Re-describe the data using fewer variables • Find the dimension that explains the most variance in your data • Find the next most explanatory dimension that is orthogonal to the first • Stop when adding more dimensions doesn’t really help • 5 top PCs • Top 10 Regressors
  • 23. Past Crime Events Time Series Geo x Modeled Crime Weather Data Time by Weather Precincts Geo by Police Pts of Interest Geo by Pt Interest Census Data Geo by Census Four Square Geo by Use Kriging Dim Reduct. How can we put this into a common frame? Census Data
  • 26. Geohashing – combining your data • Geohashing plug in (python) • 17,000 hashes (~city block) • PostGIS (postgreSQL) • ST within -- points in grids • ST accum – aggregate within grid • ST union – new shape containing all points
  • 27. Past Crime Events Time Series Geo x Modeled Crime Weather Data Time by Weather Precincts Geo by Police Pts of Interest Geo by Pt Interest Census Data Geo by Census Four Square Geo by Use Merge by GeoHash Kriging Dim Reduct. Index by time = Census Data
  • 28. Data! • 4.6 million records • Index: Date (weekly) + Geo grid (17K, ~one block) • 61 features • Model crime estimates (+6 weeks future) • Weather features • Police district features • Points of interest • Public use (check-ins) • Census features
  • 29. Modeling Split the data • Training set: random sample of ~1,000,000 records before Aug 01 2015 • Testing set: all records after Aug 01, 2015 (~200,000) Training • ~45 minutes • 3-fold cross validation Models • Random Forest • XGBoost • L1, L2 regression Model comparison • +/- location features • +/- time features
  • 31. Results: Model Performance Random Forest! R2 = .93 • Split quality criterion: MSE • Number of trees: 100 (with bootstrap) • Max trees depth: 31 • Min samples per leaf: 10 • Min samples to split: 30 Better than • Model with raw spatial and temporal features • Model without spatial features • Model without temporal features
  • 32. Results: Feature Importance 0% 4% 8% 12% 16% # Shops PCA 1 # Foursquare Checkins Poverty Status of People living alone # Foursquare Users (Total) # Travel/Transport Houses with no vehicle available # Food Places # Service Stores PCA 3 PCA1 • Lived in a different house a year ago • Household speaks limited English • Main means of transportation is not a car • Adults don’t speak English well or at all • Children speak both English and Spanish PCA3 • Household speaks language from Asia or Pacific Islands • % population born outside the US
  • 33. Results: Correlation PCA 1 Poverty Status of People PCA 3 # Service Stores # Shops # Foursquare Checkins # Food Places # Foursquare Users No vehicles # Travel/Transport PCA1 Poverty PCA3 #Service #Shops #Food #FSUsers Novehicles #Transport Crimes
  • 34. Results: Future Crime (March 15-April 1)
  • 35. Past Crime Events Time Series Geo x Modeled Crime Weather Data Time by Weather Precincts Geo by Police Pts of Interest Geo by Pt Interest Census Data Geo by Census Four Square Geo by Use Merge by GeoHash Kriging Dim Reduct. Index by time = Census Data
  • 36. Data • Open source, high dimensionality, varied across space & time Analyses • Time series, kriging, dimensionality reduction • Geohashing • Machine learning Conclusions • Keep a diverse set of tools in your data science tool box! • Using targeted analysis methods will improve model • Poverty, frequent moves, lack of vehicles, lack of English lack of commercial infrastructure predict crime rates Crime: a social and technical challenge

Hinweis der Redaktion

  1. Domestetic crime is a ubiquitous problem – costs the UK 124billion+ /year prevention, intervention, harm. Despite that, we know surprisingly little. Being able to predict, and thus prevent crime, is incredible important. Crime causes billions in damages. However, getting a handle on when, and where, a crime will occur is a seriuous challenge. Depending on what indications you look at, crime rates may be increasing or decreasing, and everything from changes in political environment to the humidity content in the air has been sited as a causes of these fluctuations.
  2. Crime presents not only a social and political challenge, but also a technical one. Traditional analyses suffer from a few issues: size and technique. Big data, however, doesn’t have all of the elements that we need
  3. Portland Oregon, because the NIJ
  4. Semi-variance Variogram, fit with exponential model