According to data compiled by the National Highway Traffic Safety Administration, in 2016, an average of ~100 people were killed in automobile accidents every day in the United States. Agero, a market leader in software-enabled driver assistance services, has responded to this growing problem with a breakthrough consumer app that provides near real-time driver behavior analysis and actionable insights to its users on how to become safer drivers.
As part of this effort, we have developed a methodology to identify the most frequent routes that each driver travels by applying Dynamic Time Warping time-series analysis techniques to spatial data. In this talk, we will give a high-level overview of the methodology, and discuss the performance improvement achieved by transitioning the software from stand-alone Python into PySpark + Databricks.
Discussion points will include how to determine the best way to (re)design Python functions to run in Spark, the development and use of user-defined functions in PySpark, how to integrate Spark data frames and functions into Python code, and how to use PySpark to perform ETL from AWS on very large datasets.
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Automobile Route Matching with Dynamic Time Warping Using PySpark with Catherine Slesnick and Scott Frye
1. Catherine Slesnick, Agero @AgeroNews
Scott Frye, Agero @scott_frye
Automobile Route Matching
with Dynamic Time Warping
Using PySpark
#DD3SAIS
2. 2#DD3SAIS
Who Is Agero?
Agero
is a mission-driven
organization obsessed with making
driving safer. Its award-winning
services combine innovative
technologies with human-powered
solutions to safeguard the driving
experience, and ultimately
save lives.
4. Who Is Agero?
4#DD3SAIS
1 in 3
licensed
drivers
Services cover
80M
consumers
10M+
annual events
7 / 10
top insurance
companies
1M+
accident
recoveries
per Year
Connecting
drivers to over
14,000
dealerships
Connecting
drivers to over
74,000 repair
shops
Operating
24 / 7 / 365
6
emergency
response
enabled
locations
5. Data Science at Agero
• 12 full-time people
• Processed more than 3 PB of
telematics data
• Python, Spark, AWS
• Span wide range of algorithms
5#DD3SAIS
We build algorithms to help
Agero make the roads a safer
place to drive
• Crash detection & prevention
• Bringing services to stranded drivers
6. MileUp™ − Agero’s Mobile Crash Detection and
Crash Prevention App
MileUp Crowdsources 100% Natural Driving
and Crash Data from Everyday Drivers
Official Launch at
#DD3SAIS 6
https://youtu.be/RyrFqd0jeKo
7. The Initial MileUp Went Viral and
Reached #10 Most Popular App
On December 14th 2016,
Agero posted:
on the Beer Money Reddit
thread. The initial post
received over 240 comments
and 180 up votes within the
first day.
1
10
100
1,000
10,000
Lifestyle App Ranking (iOS)
Top 10
in the Apple
Lifestyle category
#DD3SAIS
“Get Gift Cards for
Normal Driving”
7
8. MileUp Beta by the Numbers
300K+
active users
11K+
iOS
accidents
detected
450+severe
accidents
verified
200Mtrips captured
2 billion
miles driven
#DD3SAIS 8
10. How MileUp Works
10#DD3SAIS
Process the
data using
machine learning
Ingest sensor
data from
smartphones
Detect
accidents in
real time
Detect driving
behavior
patterns
12. Analyze Route Familiarity Driving Patterns
12#DD3SAIS
• People become familiar with routes they drive often
• Daily commute
• Home grocery store / school
• Route familiarity affects crash risk
• Most accidents occur on familiar roads
• Drivers more likely to be distracted
• Increased speeding
• More aggressive cornering
• Accidents on familiar roads tend to be less severe
13. Analyze Route Familiarity Driving Patterns
13#DD3SAIS
Comparing a user’s trips is challenging
• Comparing location data point-by-point is very expensive
• Velocities will differ between every trip
• We want to identify similar routes
in addition to same routes
We have A LOT of data for each user
14. Reduce Amount of Data to Process by Looking
at Trip Endpoints
14#DD3SAIS
Constraints:
• Process all of a user’s data together
• Work in Python / PySpark
RDDs
Python Function
User
Trip
Data
Endpoints Compare
Midpoints,
Distances
(a, b)
(A, B)
(A, C)
(A, E)
(D, F)
(G, H)
….
Matched Trip
Candidates
15. 15#DD3SAIS
Worked sometimes, but not all of the
time à need more sophisticated
analysis to refine results
Processing results for
part 1 of analysis:
• Ran on 2 weeks of data
• ~50k driver sample
• 112 cores
à Took ~ 1hr
à Produced ~ 400k trip pairs
Reduce Amount of Data to Process by Looking
at Trip Endpoints
16. Use Dynamic Time Warping (DTW)1 to Refine Trip Pairs
DTW is an algorithm for measuring the similarity between 2 temporal
sequences that may vary in speed
Any distance (Euclidean,
Manhattan, …) which
aligns the i-th point on
one time series with the i-
th point on the other will
produce a poor similarity
score
1 The majority of the contents of slides 16-18 borrowed from presentations by:
• Tim Oates: Workshop slides from Boston Big Data Tech Con 2015
• Elena Tsiporkova: http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWAlgorithm.ppt
i
i
time
#DD3SAIS 16
17. Use Dynamic Time Warping (DTW) to Refine Trip Pairs
DTW is an algorithm for measuring the similarity between 2 temporal
sequences that may vary in speed
i
i+2i
time
A non-linear (elastic)
alignment produces a
more intuitive similarity
measure, allowing similar
shapes to match even if
they are out of phase in
the time axis
#DD3SAIS 17
18. To find the best alignment between A and B
one needs to find the path through the grid
P = p1, … , ps , … , pk
ps = (is , js )
which minimizes the total distance between
them
P is called a warping function
DTW is expensive computationally!!
Use Dynamic Time Warping to Refine Trip Pairs
We used FastDTW
• A multilevel approach that recursively uses sampling
and space constraint to compute warping function
• Salvador and Philip Chan:
http://cs.fit.edu/~pkc/papers/tdm04.pdf
• Open-source Python implementation available:
https://github.com/slaypni/fastdtw
#DD3SAIS 18
19. Algorithm Parallelization/Optimization Needed
• ~400k trip pairs to check
• Average driver commute in U.S. is ~ 25 minutes
à 1,500 data points
• 1 DTW comparison took ~ 5.3 seconds
à 600 hours of compute time
19#DD3SAIS
• How I was thinking about the analysis
• How I had written POC in Python
First tried to add DTW onto previous function by driver
21. 1. Turned candidate trip pair results into Spark data frame (DF)
2. Created new user-defined function (UDF) to perform DTW on each row of
the data frame
21#DD3SAIS
Algorithm Parallelization/Optimization Needed
Solution:
22. What We Learned
• Detecting a driver’s familiar routes is possible even with LOTS of data
• Transitioning Python algorithms into PySpark can require a shift in thinking
about how to structure the code (and some trial & error)
• When working in PySpark, you can use RDDs and DFs together to
parallelize different parts of the analysis
22#DD3SAIS
Processed ~ 600 million points in 5 hours on 112 cores