CitySprint Fleetmapper use case -Big Data Bootcamp
1. Eduard Lazar - CitySprint
A geospatial and time series analysis
of the
CitySprint fleet
2. Blue signals a pick-up
Red signals a drop-off
Sample of how one driver’s journey looks like
Used for:
• Viewing the base unit of analysis
3. Demand heat map
Heat map of pickup locations density
Used for:
• Optimising resource allocation
• Identifying areas for potential expansion
4. K-means clustering analysis – 40 centres
Employed the K-means algorithm to identify clusters
of pickup points
Used for:
• Validating against current service centres map
• Identifying areas for potential expansion
5. K-means 100 centres
Higher granularity clustering
Used for:
• Assessing the frequency of pickups for micro-
clusters (e.g. villages, neighbourhoods)
• Directing drivers to hotter waiting areas
6. Geographical supply & demand
Pickup locations shown vs to routes
Used for:
• Improving likelihood of parcel pickup while on-route
7. 0.0
4.5
9.0
13.5
18.0
0 3 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Expectedparcels
Time of day
Expected parcels allocated to cluster 41 (Stevenage)
Demand variation across time
Used for:
• Positioning couriers in the right place at the right time
For each demand cluster we calculated the
frequency of pickups per hour
8. The solution outline
• Data science capabilities of Spark, easy to use with SQL knowledge
• Map plotting on ARGIS – heat mapping, zoom in/out capabilities, real-time
• High-performance due to in-memory processing capabilities of Spark
• Can work with large data sets due to high performance disk-based data access
in Hadoop File System (HDFS)
• Can import data from EDW
9. Why Bigstep?
• Easy to use - Easy to deploy, redeploy, erase and rewind. Easy to experiment with
• Big Data Focus – Infrastructure, orchestration, and software ecosystem deliver
performance & ease of use for big data
• Domain Experts – Extensive hands-on experience in delivering complex big data
solutions for multiple verticals & use cases
• Consultative Approach – Direct contact and support from experienced big data, devops,
and infrastructure specialists
• Best In Class Infrastructure – The world’s highest performance cloud
Objectives:
Take geospatial and time series data and make it easily manageable and usable by business users
Discover new business insights to optimize operations
Run real-time analysis on 22.626.119 records
Test if Spark and Hadoop are suitable data analysis tools for CitySprint
Design a flexible, versatile environment for analyzing fleet data
Implement solution with enough performance so that real time data exploration is possible on the full dataset
Follows a random driver on a typical day through pickup and dropoff points.
Map can zoom in, zoom out
Shows the hot points of pickup points along the uk. A good overview of the overall dataset.
Compared against our service center locations it shows a few differences. A clustering algorithm identifies ‘clusters’ of elements by it’s own. K-means needs to be told how many clusters to look for.
This is what happens if we tell k-means to split the dataset into 100 hot locations.
The blue dots are actual gps information of en-route drivers. Shows typical routes but only some routes go through hot areas.
A ‘cluster’ timetable is used to predict demand at a particular cluster on a particular time. Useful to instruct the driver if he is to stay or to go to it’s destination. This can help uberize the business.
Used a combination of technologies, mostly Spark on Hadoop on Bigstep. Imported data from production Postgres DB via Sqoop into avro and from there via spark into varous CSV files rendered by the ESRI (ARCGIS). Postgres concentrates information from mobile devices.