Azavea develops software to analyze large geospatial datasets for social good. Their Big (Geo) Data Science work includes spatial forecasting of crime patterns in Philadelphia using statistical analysis of past incidents. They created automated maps and alerts for the police department to help predict crime hotspots and accelerate response times. Azavea also conducts open source research on high performance geoprocessing techniques to enable analysis of massive global datasets within seconds.
3. B Corporation
• Projects w/ Social Value
• Summer of Maps
• Pro Bono Program
• Donate share of profits
Research-Driven
• 10% Research Program
• Academic Collaborations
• Open Source
5. How Phila PD uses Maps
Customized Map Products
Weekly CompStat Meetings
Web Crime Analysis
6. INCT & PARS – main database sources
over 5,000 incidents daily, over 2 million annually
PARS
Complainant INCT
Verizon Daily download
911 District & Geocoding Routines
48 Desk
Incident Report
Completed by Officer District X
911 Operator
Police Officer Maps distributed
Through Intranet, District Y
Printing, CompStat
Radio
Dispatcher
CAD District Z
17. Crime Analysis – What has happened?
– Mapping (spatial / temporal densities)
– Trending
– Intelligence Dashboard
Early Warning – What is out of the ordinary?
– Statistical & Threshold-based Hunches (data mining)
– Alerting
Risk Forecasting – What is likely to happen next?
– Near Repeat Pattern
– Load Forecasting
23. Early Warning
• Geographic Early Warning System
– A system to alert staff of an unusual situation in a particular
location
– Ingests data sets to automatically “cook on” and only
involves staff when a statistically unusual situation is found
Geostatistical Engine
Operational
Operational
Database
Alerting
Operational
Database HunchLab
Database System
Databases
35. Contagious Crime?
• Near repeat pattern analysis
• “If one burglary occurs, how does the risk change nearby?”
36. What Do We Mean By Near Repeat?
• Repeat victimization
– Incident at the same location at a later time (likely related)
• Near repeat victimization
– Incident at a nearby location at a later time (likely related)
• Incident A (place, time) --> Incident B (place, time)
37. Near Repeat Pattern Analysis
• The goal:
– Quantify short term risk due to near-repeat victimization
• “If one burglary occurs, how does the risk of burglary for the
neighbors change?”
• What we know:
– Incident A (place, time) --> Incident B (place, time)
• Distance between A and B
• Timeframe between A and B
• What we need to know:
– What distances/timeframes are not simply random?
38. Near Repeat Pattern Analysis
• The process
– Observe the pattern in historic data
– Simulate the pattern in randomized historic data
– Compare the observed pattern to the simulated patterns
– Apply the non-random pattern to new incidents
• An example
– 180 days of burglaries in Division 6 of Philadelphia
43. Near Repeat Pattern Analysis
• How can you test your own data?
– Near Repeat Calculator
• http://www.temple.edu/cj/misc/nr/
• Papers
– Near-Repeat Patterns in Philadelphia Shootings (2008)
• One city block & two weeks after one shooting
– 33% increase in likelihood of a second event
Jerry Ratcliffe
Temple University
46. Improving CompStat
• Workload forecasting
• “Given the time of year, day of week, time of day and
general trend, what counts of crimes should I expect?”
47. What Do We Mean By Load Forecasting?
• Workload forecasting
• Generating aggregate crime counts for a future timeframe
using cyclical time series analysis
Measure cyclical patterns
+
Identify non-cyclical trend
Forecast expected count
bit.ly/gorrcrimeforecastingpaper
48. Load Forecasting
• Measure cyclical patterns
• Take historic incidents (for example: last five years)
• Generate multiplicative seasonal indices
– For each time cycle:
» time of year
» day of week
» time of day
– Count incidents within each time unit (for example: Monday)
– Calculate average per time unit if incidents were evenly
distributed
– Divide counts within each time unit by the calculated average to
generate multiplicative indices
» Index ~ 1 means at the average
» Index > 1 means above average
» Index < 1 means below average
53. Load Forecasting
• Identify non-cyclical trend
• Take recent daily counts (for example: last year daily counts)
• Remove cyclical trends by dividing by indices
• Run a trending function on the new counts
– Simple average
» Last X Days
– Smoothing function
» Exponential smoothing
» Holt’s linear exponential smoothing
54. Load Forecasting
• Forecast expected count
• Project trend into future timeframe
– Always flat
» Simple average
» Exponential smoothing
– Linear trend
» Holt’s linear exponential smoothing
• Multiple by seasonal indices to reseasonalize the data
57. How Do We Know It’s Accurate?
• Testing
• Generated forecasting techniques(examples)
– Commonly Used
» Average of last 30 days
» Average of last 365 days
» Last year’s count for the same time period
– Advanced Combinations
» Different cyclical indices (example: day of year vs. month of year)
» Different levels of geographic aggregation for indices
» Different trending functions
• Scoring methodologies (examples)
– Mean absolute percent error (with some enhancements)
– Mean percent error
– Mean squared error
• Run thousands of forecasts through testing framework
• Choose the right technique in the right situation
59. Research Topics
• Risk Forecasting
– Load forecasting enhancements
• Weather and special events
– Combining short and long term risk forecasts (Temple)
• Socioeconomic changes in neighborhoods
– Risk Terrain Modeling (Rutgers)
• Context of crime at the microplace
64. Robert’s Rules of Housing
Close to Center City somewhat important
Walk to Grocery Store vital
Nearby Restaurants very important
Library nice to have
Near a Park somewhat important
Biking / walking distance from our work very important
Biking distance to fencing somewhat important
65. Your factors might include…
Child Care
Local School Rankings
Farmer's Market
Car Share
Public Transit
73. Web is different from the Desktop
Lots of simultaneous users
Stateless environment
HTML+JS+CSS
Users are less skilled
Users are less patient
74. But wait … there’s a problem
10 – 60 second calculation time
Multiple simultaneous users …
… that are impatient
86. Optimizing one process sub-optimizes others
Complex to configure and maintain
Limited to one operation
No interpolation
No mixing
– cell sizes
– extents
– projections
etc.
87.
88. Broader set of functionality
Both raster and vector
Scala + Akka
Open source