Data Philly Meetup - Big (Geo) Data

Big (Geo) Data Science

Robert Cheetham
cheetham@azavea.com
@rcheetham

Web/Mobile

Geospatial

UI/UX Design

High Performance
Computing

R&D

B Corporation
• Projects w/ Social Value
• Summer of Maps
• Pro Bono Program
• Donate share of profits

Research-Driven
• 10% Research Program
• Academic Collaborations
• Open Source

Spatial Temporal Forecasting
with Philadelphia Crime Data

How Phila PD uses Maps

Customized Map Products

Weekly CompStat Meetings

Web Crime Analysis

INCT & PARS – main database sources
over 5,000 incidents daily, over 2 million annually

PARS

Complainant INCT

Verizon Daily download
911 District & Geocoding Routines
48 Desk
Incident Report
Completed by Officer District X

911 Operator
Police Officer Maps distributed
Through Intranet, District Y
Printing, CompStat
Radio
Dispatcher
CAD District Z

The Context

1,500,000 people
7,000 police
1,000 civilian employees
2,000,000 new incidents / year
3 crime analysts

What we did

• Weekly Compstat
• Lots of maps
• Automation of map creation
• Web-based systems

… but what if we could…

 Accelerate the cycle
 Proactively notify
 Automate the process

Prototype
VB & MapObjects ArcView
.ini
file

Process Documentation

Shapefiles
and
GRIDs

MS SQL Server
Crime Incidents
Database

… but there was a problem …

We needed ….

1. Better Statistics

2. Notification

3. Simplicity

Crime Analysis – What has happened?
– Mapping (spatial / temporal densities)
– Trending
– Intelligence Dashboard
Early Warning – What is out of the ordinary?
– Statistical & Threshold-based Hunches (data mining)
– Alerting
Risk Forecasting – What is likely to happen next?
– Near Repeat Pattern
– Load Forecasting

Crime Analysis
– Mapping (spatial / temporal densities)
– Trending
– Intelligence Dashboard
Early Warning
– Statistical & Threshold-based Hunches (data mining)
– Alerting
Risk Forecasting
– Near Repeat Pattern
– Load Forecasting

Early Warning

• Geographic Early Warning System
– A system to alert staff of an unusual situation in a particular
location
– Ingests data sets to automatically “cook on” and only
involves staff when a statistically unusual situation is found

Geostatistical Engine

Operational
Operational
Database
Alerting
Operational
Database HunchLab
Database System
Databases

What is a Hunch?

• A proposed hypothesis, saved into the system, and
continually tested for validity
• Incident Attribute Requirements
– Location (x, y)
– Time (timestamp)
– Classification
• Hunch Attributes
– Location (area)
– Time (recent / historic periods)
– Classification
• Analyses
– Statistical Hunch
– Threshold Hunch

Hunch Parameters: Location

• Address & Radius
• Precinct/County/Country
• Custom Drawn Area
• Mass Hunch

Hunch Parameters: Time

• Statistical Hunch
– Recent Past
– Historic Past

Hunch Parameters: Classification

• Category
• Time of Day
• Narrative

Predictive Analytics?

• Prediction vs. Forecasting

Contagious Crime?

• Near repeat pattern analysis
• “If one burglary occurs, how does the risk change nearby?”

What Do We Mean By Near Repeat?

• Repeat victimization
– Incident at the same location at a later time (likely related)
• Near repeat victimization
– Incident at a nearby location at a later time (likely related)

• Incident A (place, time) --> Incident B (place, time)

Near Repeat Pattern Analysis

• The goal:
– Quantify short term risk due to near-repeat victimization
• “If one burglary occurs, how does the risk of burglary for the
neighbors change?”

• What we know:
– Incident A (place, time) --> Incident B (place, time)
• Distance between A and B
• Timeframe between A and B

• What we need to know:
– What distances/timeframes are not simply random?


• The process
– Observe the pattern in historic data
– Simulate the pattern in randomized historic data
– Compare the observed pattern to the simulated patterns
– Apply the non-random pattern to new incidents

• An example
– 180 days of burglaries in Division 6 of Philadelphia


• How can you test your own data?
– Near Repeat Calculator
• http://www.temple.edu/cj/misc/nr/
• Papers
– Near-Repeat Patterns in Philadelphia Shootings (2008)
• One city block & two weeks after one shooting
– 33% increase in likelihood of a second event

Jerry Ratcliffe
Temple University

Improving CompStat

• Workload forecasting
• “Given the time of year, day of week, time of day and
general trend, what counts of crimes should I expect?”

What Do We Mean By Load Forecasting?

• Workload forecasting
• Generating aggregate crime counts for a future timeframe
using cyclical time series analysis

Measure cyclical patterns

+
Identify non-cyclical trend

Forecast expected count

bit.ly/gorrcrimeforecastingpaper

Load Forecasting

• Measure cyclical patterns
• Take historic incidents (for example: last five years)
• Generate multiplicative seasonal indices
– For each time cycle:
» time of year
» day of week
» time of day
– Count incidents within each time unit (for example: Monday)
– Calculate average per time unit if incidents were evenly
distributed
– Divide counts within each time unit by the calculated average to
generate multiplicative indices
» Index ~ 1 means at the average
» Index > 1 means above average
» Index < 1 means below average

Load Forecasting

• Identify non-cyclical trend
• Take recent daily counts (for example: last year daily counts)
• Remove cyclical trends by dividing by indices

• Run a trending function on the new counts
– Simple average
» Last X Days
– Smoothing function
» Exponential smoothing
» Holt’s linear exponential smoothing

Load Forecasting

• Forecast expected count
• Project trend into future timeframe
– Always flat
» Simple average
» Exponential smoothing
– Linear trend
» Holt’s linear exponential smoothing
• Multiple by seasonal indices to reseasonalize the data

Load Forecasting

Measure cyclical patterns

+
Identify non-cyclical trend

Forecast expected count

bit.ly/gorrcrimeforecastingpaper

How Do We Know It’s Accurate?

• Testing
• Generated forecasting techniques(examples)
– Commonly Used
» Average of last 30 days
» Average of last 365 days
» Last year’s count for the same time period
– Advanced Combinations
» Different cyclical indices (example: day of year vs. month of year)
» Different levels of geographic aggregation for indices
» Different trending functions
• Scoring methodologies (examples)
– Mean absolute percent error (with some enhancements)
– Mean percent error
– Mean squared error
• Run thousands of forecasts through testing framework
• Choose the right technique in the right situation

Research Topics

• Risk Forecasting
– Load forecasting enhancements
• Weather and special events

– Combining short and long term risk forecasts (Temple)
• Socioeconomic changes in neighborhoods
– Risk Terrain Modeling (Rutgers)
• Context of crime at the microplace

Research Topics

• Risk Forecasting
– Offender Management
• Prioritize offenders based upon statistical models using past
behaviors
• Evaluation
– Automate Randomized Controlled Trials

Data Processing for Big (Geo) Data

Robert’s Rules of Housing
Close to Center City  somewhat important
Walk to Grocery Store  vital
Nearby Restaurants  very important
Library  nice to have
Near a Park  somewhat important
Biking / walking distance from our work  very important
Biking distance to fencing  somewhat important

Your factors might include…
 Child Care
 Local School Rankings
 Farmer's Market
 Car Share
 Public Transit

We stand on the
shoulders of giants

Not a new idea … Design with Nature

Not a new Idea … Dana Tomlin

Weighted Overlay

+ + +

x5 x1 x3 x2

=

Summary

Geography-driven Decisions

Iterative

Individual

Web [and Mobile]

Growing data sets

Web is different from the Desktop

 Lots of simultaneous users

 Stateless environment

 HTML+JS+CSS

 Users are less skilled

 Users are less patient

But wait … there’s a problem
 10 – 60 second calculation time

 Multiple simultaneous users …

 … that are impatient

Specific Optimization Goals
 New Raster File Structure

 Distributed processing

 Binary messaging protocol

Optimization: File Format
 Limit data type and range

 1D arrays are fast to read/write

 Tiled

 Pyramids

 Azavea Raster Grid (ARG)

Optimization: Distributed Processing
 Parallelizable - Local Ops and Focal Ops

 Support multiple
– Threads
– Cores
– CPU’s
– Machines

 Considered
– Hadoop
– Amazon Map Reduce
– Beowolf

Success!!
Reduced from 10-60 seconds to

<500 milliseconds

Optimizing one process sub-optimizes others
 Complex to configure and maintain
 Limited to one operation
 No interpolation
 No mixing
– cell sizes
– extents
– projections
 etc.

 Broader set of functionality

 Both raster and vector

 Scala + Akka

 Open source

Regional/State: 84 ms

National: 84 ms

Large Country 115 ms

Continental 271 ms

Planet 1.2 – 2.0 s

GPU Results
 Re-wrote a few Map
Algebra operations:
 Local
 Neighborhood
 Zonal
 Viewshed
 etc.
 15 – 120x
 Large grids
 Large kernels

New Spatial Operations
Vector

Neighborhood/Focal

Spatial Statistics

Integration

Urban Forest Ecosystem Modeling

Crime Analysis, Early Warning and Forecasting

Open Source Geoprocessing

 GDAL

 GeoServer

 PostGIS

R

 GeoDa

Big (Geo) Data Science

[We are hiring]

Robert Cheetham
cheetham@azavea.com
@rcheetham

Data Philly Meetup - Big (Geo) Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Data Philly Meetup - Big (Geo) Data

Ähnlich wie Data Philly Meetup - Big (Geo) Data (20)

Mehr von Azavea

Mehr von Azavea (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Philly Meetup - Big (Geo) Data