Automobile Route Matching with Dynamic Time Warping Using PySpark with Catherine Slesnick and Scott Frye

Catherine Slesnick, Agero @AgeroNews
Scott Frye, Agero @scott_frye
Automobile Route Matching
with Dynamic Time Warping
Using PySpark
#DD3SAIS

2#DD3SAIS
Who Is Agero?
Agero
is a mission-driven
organization obsessed with making
driving safer. Its award-winning
services combine innovative
technologies with human-powered
solutions to safeguard the driving
experience, and ultimately
save lives.

3#DD3SAIS
Mobile
Telematics
Consumer
Affairs
Data &
Analytics
Lead
Generation
Vehicle
Breakdown
Assistance
Accident
Scene
Management
What Agero Offers

Who Is Agero?
4#DD3SAIS
1 in 3
licensed
drivers
Services cover
80M
consumers
10M+
annual events
7 / 10
top insurance
companies
1M+
accident
recoveries
per Year
Connecting
drivers to over
14,000
dealerships
Connecting
drivers to over
74,000 repair
shops
Operating
24 / 7 / 365
6
emergency
response
enabled
locations

Data Science at Agero
• 12 full-time people
• Processed more than 3 PB of
telematics data
• Python, Spark, AWS
• Span wide range of algorithms
5#DD3SAIS
We build algorithms to help
Agero make the roads a safer
place to drive
• Crash detection & prevention
• Bringing services to stranded drivers

MileUp™ − Agero’s Mobile Crash Detection and
Crash Prevention App
MileUp Crowdsources 100% Natural Driving
and Crash Data from Everyday Drivers
Official Launch at
#DD3SAIS 6
https://youtu.be/RyrFqd0jeKo

The Initial MileUp Went Viral and
Reached #10 Most Popular App
On December 14th 2016,
Agero posted:
on the Beer Money Reddit
thread. The initial post
received over 240 comments
and 180 up votes within the
first day.
1
10
100
1,000
10,000
Lifestyle App Ranking (iOS)
Top 10
in the Apple
Lifestyle category
#DD3SAIS
“Get Gift Cards for
Normal Driving”
7

MileUp Beta by the Numbers
300K+
active users
11K+
iOS
accidents
detected
450+severe
accidents
verified
200Mtrips captured
2 billion
miles driven
#DD3SAIS 8

© 2018 AGERO, INC. PROPRIETARY AND CONFIDENTIAL. A CROSS COUNTRY GROUP COMPANY.
MILEUP CAPTURED
~1 MILLION
TRIPS PER DAY
© 2018 AGERO, INC. PROPRIETARY AND CONFIDENTIAL. A CROSS COUNTRY GROUP COMPANY.

How MileUp Works
10#DD3SAIS
Process the
data using
machine learning
Ingest sensor
data from
smartphones
Detect
accidents in
real time
Detect driving
behavior
patterns

Analyzing Driving Patterns Is Data Intensive
11#DD3SAIS

Analyze Route Familiarity Driving Patterns
12#DD3SAIS
• People become familiar with routes they drive often
• Daily commute
• Home grocery store / school
• Route familiarity affects crash risk
• Most accidents occur on familiar roads
• Drivers more likely to be distracted
• Increased speeding
• More aggressive cornering
• Accidents on familiar roads tend to be less severe

Analyze Route Familiarity Driving Patterns
13#DD3SAIS
Comparing a user’s trips is challenging
• Comparing location data point-by-point is very expensive
• Velocities will differ between every trip
• We want to identify similar routes
in addition to same routes
We have A LOT of data for each user

Reduce Amount of Data to Process by Looking
at Trip Endpoints
14#DD3SAIS
Constraints:
• Process all of a user’s data together
• Work in Python / PySpark
RDDs
Python Function
User
Trip
Data
Endpoints Compare
Midpoints,
Distances
(a, b)
(A, B)
(A, C)
(A, E)
(D, F)
(G, H)
….
Matched Trip
Candidates

15#DD3SAIS
Worked sometimes, but not all of the
time à need more sophisticated
analysis to refine results
Processing results for
part 1 of analysis:
• Ran on 2 weeks of data
• ~50k driver sample
• 112 cores
à Took ~ 1hr
à Produced ~ 400k trip pairs
Reduce Amount of Data to Process by Looking
at Trip Endpoints

Use Dynamic Time Warping (DTW)1 to Refine Trip Pairs
DTW is an algorithm for measuring the similarity between 2 temporal
sequences that may vary in speed
Any distance (Euclidean,
Manhattan, …) which
aligns the i-th point on
one time series with the i-
th point on the other will
produce a poor similarity
score
1 The majority of the contents of slides 16-18 borrowed from presentations by:
• Tim Oates: Workshop slides from Boston Big Data Tech Con 2015
• Elena Tsiporkova: http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWAlgorithm.ppt
i
i
time
#DD3SAIS 16

Use Dynamic Time Warping (DTW) to Refine Trip Pairs
DTW is an algorithm for measuring the similarity between 2 temporal
sequences that may vary in speed
i
i+2i
time
A non-linear (elastic)
alignment produces a
more intuitive similarity
measure, allowing similar
shapes to match even if
they are out of phase in
the time axis
#DD3SAIS 17

To find the best alignment between A and B
one needs to find the path through the grid
P = p1, … , ps , … , pk
ps = (is , js )
which minimizes the total distance between
them
P is called a warping function
DTW is expensive computationally!!
Use Dynamic Time Warping to Refine Trip Pairs
We used FastDTW
• A multilevel approach that recursively uses sampling
and space constraint to compute warping function
• Salvador and Philip Chan:
http://cs.fit.edu/~pkc/papers/tdm04.pdf
• Open-source Python implementation available:
https://github.com/slaypni/fastdtw
#DD3SAIS 18

Algorithm Parallelization/Optimization Needed
• ~400k trip pairs to check
• Average driver commute in U.S. is ~ 25 minutes
à 1,500 data points
• 1 DTW comparison took ~ 5.3 seconds
à 600 hours of compute time
19#DD3SAIS
• How I was thinking about the analysis
• How I had written POC in Python
First tried to add DTW onto previous function by driver

20#DD3SAIS
Problem:
Because different users had very different numbers of trip pairs to check, the
cores were being used unevenly
First tried to add DTW onto previous function by driver

1. Turned candidate trip pair results into Spark data frame (DF)
2. Created new user-defined function (UDF) to perform DTW on each row of
the data frame
21#DD3SAIS
Solution:

What We Learned
• Detecting a driver’s familiar routes is possible even with LOTS of data
• Transitioning Python algorithms into PySpark can require a shift in thinking
about how to structure the code (and some trial & error)
• When working in PySpark, you can use RDDs and DFs together to
parallelize different parts of the analysis
22#DD3SAIS
Processed ~ 600 million points in 5 hours on 112 cores

Questions?
23#DD3SAIS
Cathy Slesnick: cslesnick@agero.com, @AgeroNews
Scott Frye: sfrye@agero.com, @scott_frye

Automobile Route Matching with Dynamic Time Warping Using PySpark with Catherine Slesnick and Scott Frye

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Automobile Route Matching with Dynamic Time Warping Using PySpark with Catherine Slesnick and Scott Frye

Ähnlich wie Automobile Route Matching with Dynamic Time Warping Using PySpark with Catherine Slesnick and Scott Frye (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Automobile Route Matching with Dynamic Time Warping Using PySpark with Catherine Slesnick and Scott Frye