Christian jensen advanced routing in spatial networks using big data

Advanced Routing in Spatial
Networks Using Big Data
Christian S. Jensen
www.cs.aau.dk/~csj

Roadmap
• Big data, motivation, setting, overview
• Modeling and computation of weights
 Spatial network lifting
 EcoMark
 Time-varying, uncertain weights
 Complete weight annotation using incomplete information
 Near-future travel cost inference
• Routing
 Stochastic skyline routing
 Routing based on local-driver behavior
• Closing
 Demos, the future, challenges, acknowledgments, readings

Roadmap
• Big data
 Hype or substance?
 Instrumentation of reality and digitization
 The digital universe
 Moore’s Law generalized
 Big data challenges

Hype or Substance?
• We have been pushing the boundaries for decades
 How much data we can handle
 How fast
 Data integration
• Examples
 VLDB: International Conference on Very Large Database
 TODS: ACM Transactions on Database Systems
• So is it all hype?
 No

Instrumentation and Digitization
• Instrumentation of reality
 Notably, smartphones
• Digitization of processes
 E.g., e-commerce, public services, communications, social
interactions

Big Data
Every day, we create 2.5 quintillion bytes of data — so much
that 90% of the data in the world today has been created in
the last two years alone. This data comes from everywhere:
sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few. This
data is big data.
http://www-01.ibm.com/software/data/bigdata/

The Digital Universe
• The digital universe
 Doubling every ~18-24 months
 Grew 60% in 2009, 50% in 2010
 2009: 0.8 zettabyte, 2010: 1.2 zettabyte, 2020: 35 zettabytes
 2009-20: growth by a factor of 44
http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm
1 zettabyte = 1024 exabytes
1 exabyte = 1024 petabytes =
260 = 1,152,921,504,606,846,976 ≈ 1018 bytes

Moore’s Law – The Bicycle Analogy
• Moore’s Law: computers double in speed every 24
months.
 Applies also to quality-adjusted microprocessor prices, memory
capacity, disks, networks, sensors, and the number and size of
pixels in cameras.
• How fast would a bicyclist be if Moore’s Law applied?
 50 years of doubling every 24 months
 30 km/h originally
 ~1 billion km/h now
• Three lessons
 Growth rates in computing are dramatic and difficult to imagine.
 Hardware advances are important information technology drivers.
 Humans don’t really improve – they are the constants.

Big Data – Synthesis
• The result is new opportunity.
• Lots of data and unprecedented computing infrastructure
combine to offer potentials for value creation from data.
• To be competitive, society and businesses must be able to
create value from data.
• Data-based decisions and data-driven processes
 Decisions based on good data beat decisions based on feelings or
opinions.
• A finer granularity of services
• Entirely new services

Big Data – Data-Driven Society, Business

Christian S. Jensen
csj@cs.aau.dk
Big Data in Transportation:
Overview

Motivation – ITS
• A safer, greener, and more efficient and cost-effective
transportation infrastructure
• Greenhouse gas emissions reductions via eco-routing
• Congestion, greater Copenhagen region
 ~10 billion DKK/year (2004)
• Bad setting of signalized intersections in Denmark
 ~9,3 billion DKK/year (2012)

• The transportation sector is the second largest
greenhouse gas (GHG) emitting sector and also causes
substantial pollution.
 By 2025, it’s predicted that 6.2 billion commutes will be made
every day, worldwide.
• The reduction of greenhouse gas (GHG) emissions from
transportation is essential to combat global climate
change.
 EU: reduce GHG emissions by 30% by 2020.
 G8: a 50% GHG reduction by 2050.
 China: a 17% GHG reduction by 2015.
• Eco-routing can reduce vehicular impact by up to 20%.
• General context: Smart City
Motivation – Eco-Routing

System of Sensors Model
• The setting may be modeled as a system of (logical)
streams, one per edge.
 Data is emitted from the stream of an edge when a vehicle
traverses the edge
 Spatial
 Spatio-temporally correlated
 Sparse
• Real, unlike early, envisioned smart dust applications!

Methodology
• The same methodology underlies the studies covered.
• Define precisely a problem of (perceived) real-world
interest.
• Develop solutions
 Concepts, data structures, algorithms
• Carry out mathematical analyses
 Correctness, complexity, storage size
• Prototype the solutions and perform empirical studies
 Often, real data is needed
 Offers detailed insight in the design properties of the solutions
 If you cannot measure it, you cannot improve it
• Iterate!

Data Foundation
• A road network model
 E.g., OpenStreetMap
 Germany: 10 million edges (rough estimate)
• Aerial laser scan data
 For lifting, optional
• Tracking data from vehicles
 E.g., GPS data
• Fuel consumption data from vehicles
 Often called CAN bus data

GPS Data – DW Statistics
• Number of data sources (and NDAs): ca. 20
 Daily data from 4 sources; data in batches from 4 sources (2 small,
2 big); data from ca. 10 finished projects
• Daily data
 4 sources
 ~3 million rows
 3500 vehicles
• Number of fact table rows: More than 4 billion
• Number of vehicles: ca. 25,000
• Rows with fuel data: 500 million (estimate; ~140,000 per
day)
• Rows from EVs: 100-200 million

Software and Hardware
• Experimental infrastructure
• Software
 Have complete software stack for handling traffic data
 Map-matching, data cleansing, multiple map support
• Hardware
 Very modern server farm
 Newest machine has 2TB main memory

Basic Eco Routing – CPH to the Train
Good 17:08 8.06 km 0.80 l
Bad 18:06 13.64 km 1.23 l

Roadmap to Achieving Eco-Routing
EcoMark 3D Road Networks
1. Understand how to quantify environmental impact
2. Assign Eco-Weights to a Road Network
3. Novel Routing Algorithms

Understand Vehicular Environmental Impact
• EcoMark and EcoMark 2.0
 An evaluation framework that compares and analyzes 11 state-of-
the-art vehicular environmental impact models using GPS data
and a 3D road network.
 What do the models need to quantify vehicular environmental
impact?
 Speeds (instantaneous or average), and accelerations: from GPS data
 Road slopes: from 3D road network
 Model categorization
 Instantaneous models: appropriate for eco-driving
 Aggregated models: appropriate for eco-routing
 It is possible to assign eco-weights to a large extent using GPS
data, 3D maps, and appropriate models.
 An extended evaluation framework, EcoMark 2.0, is developed on
top of EcoMark using actual fuel consumption data that is obtained
from CAN bus data.

a1'
a11'
a1
a11
a5'
a5
Elevation matters – 3D Road Network
• Efficient methods that can build a 3D road network from
3D laser scan data and a 2D road network.
• Use 3D laser scan data to formulate a terrain that is
represented as a Triangulated Irregular Network (TIN).
• Project 2D polylines to the TIN in order to obtain 3D
polylines, and thus result in a 3D road network.

Simple vs. Advanced
Eco-Weights
Static vs. Dynamic
Eco-Weights

Eco-Weights – Simple vs. Advanced
• Simple: time dependent, deterministic.
 Each interval has a deterministic weight.
 Ex: (0:00, 7:00]: 10 mg; (7:00, 9:00]: 18 mg; (9:00, 15:00]: 12 mg; …
• Advanced: time dependent, uncertain.
 Each interval has an uncertain weight, i.e., a random variable that
follows some distribution (e.g., a uniform/normal distribution)
 Ex: (0:00, 7:00]: 8 to 12 mg, N(9, 10); (7:00, 9:00]: 13 to 23 mg,
N(18, 10); (9:00, 15:00]: 8 to 12 mg, N(10, 20); …
Representation Coverage Temporal
Granularity
Accuracy
Simple
Weights
Intervals, each
with a single
value
High
(all edges)
Coarse
(e.g., pre-
defined PEAK)
Good (better than
speed limits)
Advanced
Weights
Intervals, each
with a histogram
or a probability
density function
Low
(only ―hot‖
edges)
Fine
(e.g., 15-min
intervals)
High level of
detail (capture
the travel cost
distributions)

Eco-Weights – Static vs. Dynamic
• Static Weights
 Obtained based on historical GPS data and remain static
 Capture typical traffic patterns
• Dynamic Weights
 Updated as the real-time GPS data streams in
 Capture dynamic and instant traffic patterns
• Our solutions
Static Dynamic
Simple Graph Weight
Annotation
Spatio-Temporal
Hidden Markov
Model
Advanced Time-Dependent
Histograms
Spatio-Temporal
Hidden Markov
Model

Assigning Eco-Weights
• Static, simple weights, all edges
 Challenge: infer weights for ―cold‖ edges – the edges that are not
covered by any GPS data
 Graph weight annotation
 Quantify edge similarities based on traffic flows that are derived from the
topology of a road network
 Propagate the weights on hot edges to similar cold edges
• Static, advanced weights, hot edges
 Capturing the weights of hot edges using time dependent histograms
 Challenge: compact representation of the histograms
• Dynamic, simple and advanced weights, hot edges
 Inferring near-future weights of hot edges as GPS data streams in
 Challenge: sparseness, dependency, and heterogenity

Time-Dependent Histograms
• We need to contend with three challenges.
• Multiple travel costs
 Consider not only GHG emissions, but also distance, travel time.
• Time-dependence
 Some travel costs are time-dependent.
 E.g., travel times on the same road may vary during peak vs. off-
peak periods.
• Uncertainty
 Some travel costs are uncertain.
 E.g., aggressive and moderate driving on the same road may
produce different GHG emissions.

Road Network Model: MTUG
• MTUG: Multi-cost, Time-dependent, Uncertain Graph
• N different costs are of interest.
 Distance (DI), travel time (TT), GHG emissions (GE).
• An MTUG is a graph G = (V, E, MM, W)
 V is the vertex set, and E is the edge set.
 MM = <MM(1), …, MM(N)>
 Function MM(i) maps an edge to the minimum and maximum i-th
cost of using the edge.
 MM(TT) (ei) = (150 seconds, 500 seconds)
 MM(GE) (ei) = (10 ml, 85 ml)
 W = <W(1), …, W(N)>
 Function W(i) maps an edge to a set of (interval, random variable)
pairs of the i-th cost type.
 W(TT) (ej) = {([0:00, 7:15), N(300, 120)), ([7:15, 8:45), N(450, 100)), …}
 W(GE) (ej) = {([0:00, 7:00), N(30, 100)), ([7:00, 9:00), N(50, 80)), …}

Instantiation of MM in an MTUG
• GPS records are map matched to edges.
• Each edge is associated with a set of edge records of the
form (e, t, C)
 An edge record indicates that a traversal on edge e at time t takes
costs C, where C is a vector of all costs of interest.
 (e1, 8:08, <55 seconds, 80 ml>)
 (e1, 9:08, < 45 seconds , 63 ml>)
 (e1, 9:10, < 43 seconds , 60 ml>)
 (e1, 10:03, < 45 seconds , 62 ml>)
• MM(TT) (e1) = (43 seconds, 55 seconds)
• MM(GE) (e1) = (60 ml, 80 ml)

Instantiation of W in an MTUG
• Partition a day into 96 15-min intervals.
• For each edge and interval, we obtain a multi-set containing
the costs on the edge during the interval.
 ms = {(10 s, 3), (20 s, 10), (25 s, 20), (30 s, 10), (40 s, 5), …}
• Estimate a random variable (RV) based on the multi-set.
• Combine adjacent intervals with similar RVs and estimate a
new RV. The long intervals along with the RVs instantiate W.

Route Costs in MTUG
• Given a route Ri = <r1, r2, …, rX>, where ri ∈ E is an edge.
• RC(Ri, t) indicates the costs of using route Ri at time t
 RC(Ri, t) = <RVDI, RVTT, RVGE> is a vector of RVs, where each RV
corresponds to a travel cost.
• RVDI is deterministic and equals the sum of the length of
each edge in route Ri.
• RVTT is the convolution of the travel time RV of each edge
in route Ri.
 The travel time RV of the first edge r1 is dependent on the trip start
time t.
 The travel time RV of the k-th edge rk is dependent on the travel
time of the previous k-1 edges, which are generally uncertain.
• RVGE is the convolution of the GHG emissions RV of each
edge in route Ri.

Simple vs. Advanced
Eco-Weights
Static vs. Dynamic
Eco-Weights
Skyline
Eco-Routing
Continuous
Eco-Routing
Personalized
Eco-Routing

Novel Routing Algorithms
• Stochastic skyline route planning under time-varying
uncertainty.
 Identify pareto-optimal routes considering multiple travel costs,
e.g., travel times, GHG emissions, distances.
 R1 (15 min, 500 ml, 25 km)
 R2 (13 min, 520 ml, 26 km)
 R3 (16 min, 530 ml, 30 km)
• Continuous routing that supports real-time updates on
eco-weights (i.e., dynamic eco-weights).
 Real-time re-navigate vehicles according to recently updated
eco-weights and the current locations of vehicles.
• Context-aware, personalized routing
 Identify context-aware preferences for each driver (e.g., time-
efficient driving or fuel-efficient driving).
 Choose a single route that best satisfies the driver’s
preferences.

Deterministic Skyline Routes
• Route cost: cost(Ri) = <DI, TT, GE>
 A vector of deterministic values.
 Each value corresponds to a travel cost.
• Dominance relationship
 Ri dominates Rj iff all the costs of Ri are no greater than those of Rj,
and there is at lest one cost of Ri is smaller than that of Rj.
• Consider multiple routes for the same source-destination.
 R1: 3.5 km, 230 mg, 10 min;
 R2: 5.1 km, 250 mg, 11 min;
 R3: 5.1 km, 200 mg, 12 min;
• The skyline routes are the non-dominated routes.
 Since R2 is dominated by R1, R1 and R3 are the skyline routes.

Stochastic Dominance
• Route cost: RC(Ri) = <RVDI, RVTT, RVGE>
 A vector of random variables (RVs), where each RV represents the
distribution of a travel cost.
• Stochastic Dominance between two RVs
 Given two RVs X and Y, if cdfX(a) >= cdfY(a), for all possible value
a in R+, we say ―X stochastically dominates Y‖.
 Cost(R1).RVTT stochastically dominates Cost(R2). RVTT.
 No stochastic dominance between Cost(R3). RVTT and Cost(R4).
RVTT.

Stochastic Skyline Routes
• Dominance between two routes Ri and Rj
 If each RV of cost(Ri) stochastically dominates the corresponding
RV of cost(Rj), then Ri dominates Rj.
• Stochastic Skyline Routes
 Given a source-destination pair, and a trip starting time.
 Stochastic skyline routes are the routes that are not dominated by
any other routes.
• Brute force stochastic skyline route planning
 Enumerate all possible routes, compute the route costs, and check
whether one route dominates another.
 Very inefficient and works only for small road networks.
• Efficient stochastic skyline route planning
 Prune early some routes that cannot be skyline routes.
 Efficient stochastic dominance checking.

Example Result
• Skyline routes R1, R2, and R3, identified by our algorithm
• R1: 94,849 m; R2: 106,216 m; R3: 91,382 m;
 DI: R3 dominates R1 and R2.
 TT: R1 dominates R2 and R3
 GE: R2 dominates R1 and R3.

Personalized Routing
• Different drivers may take different routes
because they may have quite different
preferences.
• The same drivers may take different
routes in different contexts.
 Morning: try to save time to avoid being late.
 Weekend afternoon: try to save fuel
consumption.
• Challenges
 Identify contexts for drivers and identify driving
preference in each context.
 Deal with time-dependent uncertain travel
costs, e.g., travel time and fuel consumption,
while considering individual drivers’ driving
behaviors, e.g., aggressive vs. moderate
driving.

Drivers’
Trajectories
Framework
Context and Preference
Identification
Contexts &
Preferences
Driver,
Source,
Destination,
Time
RoutingModule
OptimalRoutes
Offline Phase Online Phase

Example Results
 Dark, bold routes: actual routes used by drivers.
 Red routes: shortest routes.
 Green routes: fastest routes.
 Blue routes: predicted routes using the identified contexts and
driving preferences.

Summary
• Eco-Routing is able to effectively reduce environmental
impact from transportation section.
• Eco-routing is achieved by
 Reliably and accurately quantifying vehicles’ environmental impact
using GPS data and 3D road networks;
 Assigning various types of eco-weights to edges;
 Developing novel routing algorithms that enable effective and
practical eco-routing service.

Building Accurate 3D
Spatial Networks

Outline
• Motivation
• Solution
 Phase A: Filtering
 Phase B: Lifting
• Experiments
• Summary

High-Res 3D Model Usage
Eco-Routing
•USD 6 Billion per annum in Fuel
savings with 3D model –
TomTom
•EcoMark: Benchmark shows
better CO2 emissions estimation.

The Idea
2D Road Network Aerial Laser Scan
3D Road Network

Classifying 3D Models
• Need for accurate 3D spatial networks by
applications.
• Ease of availability of rich terrain datasets.
3D Road Model
High-Res Model
•± 20 cm Accuracy
•Higher Storage Cost
Low-Res Model
•± 2 mts Accuracy
•Lower Storage Cost

Current 3D Map Generation Methods
• Van fitted with Differential GPS driven on
roads

Current 3D Map Generation Methods
• Crowd-sourcing from ―Personal
Navigation Devices‖ (PNDs).

Solution
GOAL: Design two methods to
generate High and Low resolution
3D maps.

Problem Definition
 Input
 A 2D spatial network, where an edge is modeled as a 2D polyline
 A laser scan point cloud containing a set of 3D points
 Output
 A 3D spatial network, where an edge is modeled as a 3D polyline
2D Spatial Network
+ 3D points
3D Spatial Network
3D Points
ε-Neighborhood

Framework
2D Spatial Network Laser Scan Point Cloud
One pass
Filtering
External Memory
based Filtering
Filtering
Delaunay Triangulation
or
Lifting
Interpolation
3D Spatial Network

One-Pass Filter (OPF)
•3D laser scan points are ―Inherently sorted spatially‖!
•Data Guarantee: A 1m x 1m cell contains two 3-D points.
•Assumption: The road is much smaller and fits in main memory.
 Build a grid-index Hg from the road
segments, mapping road points to
cell IDs.
 Scan the massive 3D point cloud in a
single pass and build a 3D point to
cell mapping Hp ”seeded” by Hg
 If 3D points exceed a threshold cell
occupancy of α x δ x δ where α is
the user defined threshold in [0,1]
and δ is the grid cell width -> Lift cell
contents and release cell memory!

One-Pass Filter (OPF)
Advantages of OPF
Needs no pre-processing of massive 3D point cloud for sorting or
indexing.
Builds parts of the 3D road network as 3D points are streaming
through.
Faster runtimes, lower resolution. (± 2m accuracy)

External Memory based Filtering (EMF)
• No assumption about road network size.
Neither road points nor 3D points fit in
memory. Hence, stored on disk.
• Limited memory budget available to the
method.
 Moore Neighborhood: 3,5,8-cell
neighborhood. Ex: grey shaded
region around cell (1,1).

Locality Preserving Space-Filling Curves

 Read road point disk blocks Br from
disk into main memory.
 For each block Br - read 3D points
Moore neighborhood blocks from
disk into a limited LRU cache
(size given in multiples of the grid
width c).
 Continue this process in a space-
filling curve (Z-order or Row-Major)
fashion to avoid re-reading 3D point
blocks from disk.

Z-Order vs. Row-Major Disk Access

 Why is Row-Major doing well at 3c?
LRU Cache
Block to process
Neighbor Block needed in memory
Blocks in cache

LRU Cache (Filled up)
Block to process
Blocks in cache

 Answer: Disk Blocks are read into memory exactly once !
LRU Cache (Filled up)
Block to process
Blocks in cache
Blocks evicted

Advantages of EMF
Requires little main memory
Slower runtimes, higher resolution (± 20 cm accuracy)
Can work to ―Zoom-in‖ on OPF created maps to improve
resolution on mobile devices

Triangulation
• Delaunay Triangulation
 After filtering, a set of 3D points is associated with a cell of road
segments.
 The set of 3D points is triangulated.
3D Points
2D Road Points
OPF vs. EMF

Interpolation
a1'
a11'
a1
a11
a5'
a5
 Project the TIN triangles
to the x-y plane.
 Compute the
intersections of triangles
and road 2D polyline.
 Re-project road points +
intersections back to TIN
surface again.
 Interpolate road points
on TIN to get elevations.

Experimental Study
Dataset Statistics
•Aalborg (AA) : 7km x 5km
•North Jutland (NJ) : 185km × 130km
•NJ approx. 120 times larger than AA.

Experimental Study (Accuracy)
OPF vs. EMF

Experimental Study (Scalability)
• North Jutland (NJ) covering a region of 185km × 130km
• EMF
 Accuracy ±20 cm in 21 hrs with memory use of 230 MB
• OPF
 Accuracy ±2 m in 2 hrs, which is an order of magnitude
faster than EMF, using 2.4 GB of main memory

Summary
• Augment a 2D road network with elevation values from 3D
point cloud
• Propose two methods that work with limited memory budgets
• Extensive experiments to offer insight into both methods
** 3D Spatial Network available from
UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/)

EcoMark:
Evaluating Models of Vehicular
Environmental Impact

Outline
• Motivation
• EcoMark overview
• Model analysis
• Experiments
• Context
• Summary

Motivation
• The reduction of greenhouse gas (GHG) emissions from
transportation is essential to combat global climate
change.
• Eco-driving and eco-routing can reduce GHG emissions.
 Eco-driving: driving behavior
 Eco-routing: routes that minimize fuel usage
• We need to be able to measure vehicular environmental
impact.
 Several impact models have been developed, but a
comprehensive comparison of these is missing.
 The utility of the models for eco-driving and eco-routing is not well
understood.

EcoMark
• EcoMark aims to enable a better understanding of how to
compute the environmental impact of vehicles.
• EcoMark is a benchmark that compares and analyzes 11
state-of-the-art vehicular environmental impact models.
 No existing such comprehensive benchmark.
• Key characteristics of EcoMark
 Uses GPS data from vehicles and a 3D spatial network
 Categorizes the 11 models into two major categories
 Compares and analyzes the models in the same categories, and
the models between the two categories

EcoMark Framework
GPS observations 2D Spatial Network Laser Scan Points
Input
Instantaneous Impact
6 Instantaneous Models 5 Aggregated Models
Aggregated Impact
Comparison and Analysis
Insights
EcoMark
Map Matching
GPS Trajectories
Spatial Network Lifting
3D Spatial Network

3D Spatial Network
• Spatial network lifting
 2D spatial network: OpenStreetMap
 Aerial laser scan of Denmark (1+ point per m2; 2.5 TB for
Denmark)
• A 3D spatial network is a directed, weighted graph
denoted as G = (V, E, F, H), where
 V and E are the vertex and edge sets.
 F maps vertices to 3D coordinates.
 H maps edges to grades.

Benchmark Experiments
• Exp1: Eco driving: Comparisons of instantaneous models
 Compare the behaviors of the 6 models.
 Identify models that behave similarly.
• Exp2: Eco routing: Comparisons of aggregated models
 Compare the behaviors of the 5 models.
 Identify models that behave similarly.
• Exp3: Cross-comparisons of instantaneous and
aggregated models
 Identity the similarity of the aggregation of instantaneous models
and aggregated models.
• Exp4: Study of models that support grade information
 How important is 3D – effects of using a 3D spatial network.

Insights
• Instantaneous models
 Five of six models behaved consistently.
 They can be used to measure the impact of driving behaviors.
 They can be used to classify good and bad driving behaviors.
• Aggregated models
 All the five models behaved consistently.
 Not suitable to identify impact at operational levels.
 They can be used to assign eco-weights to road segments, and
are helpful for eco-routing.
• Importance of 3D Spatial Networks
 Road grades affect all models substantially.
 Eco-driving and eco-routing both benefit.

Summary
• EcoMark
 Evaluates empirically 11 existing models of vehicular
environmental impact using a large trajectory database and a
lifted, 3D OpenStreetMap
 Findings: Instantaneous models support eco driving, while
aggregated models support eco-routing.
• Next steps
 Accuracy comparison with ground truth CAN bus data (EcoMark
2.0)
 Requires access to corresponding GPS and CAN bus data.
 Eco driving and eco routing

Time-Dependent Uncertain Eco-
Weights for Road Networks

Outline
• Motivation
• Problem definition
• Framework
• ERN building
• GHG emissions estimation
• Empirical study
• Summary

Motivation – Advanced Eco-Weights
• Eco-routing is an effective way to reduce vehicular
environmental impact.
• The capture of the environmental costs of traversing
road network edges is key to eco-routing.
 Eco-weights are uncertain.
 Eco-weights are time-dependent.
 Once obtained, existing routing algorithms can be
applied to compute eco-routes.

Problem Definition
• Goal: Obtain an Eco Road Network G = (V, E, F)
 V: Vertex set. Each vertex indicates a road intersection.
 E: Edge set. Each edge indicates a road segment.
 Function F assigns a time-dependent, uncertain eco-weight to
each edge in E.
• Input
 A set of map-matched trajectories TR.
 An accompanying road network G' = (V, E, Null).
• Output
 The Eco Road Network G = (V, E, F).

Approach
• Time-dependent uncertain histograms.
 A vector of (period, histogram) tuples <Ti, Hi>.
 Hi is the histogram describing the distribution of cost values
observed in period Ti.
• Time-dependent uncertain histograms are used to
represent the eco-weight of each road network edge.
• Several data compression techniques are applied to
reduce the storage space while retaining accuracy.
[10 a.m., 11 a.m.)[9 a.m., 10 a.m.)[8 a.m., 9 a.m.)
0
0,5
1
0
0,5
1
0
0,5
1

Framework
Trajectories TR Road Network
G = (V, E, NULL)
Traversal Record Analysis
Initial Histogram Construction
Histogram Merging
Bucket Reduction
Eco Road Network
G = (V, E, F)

Traversal Record Analysis
• GPS records are map-matched to the corresponding
edges.
• Map-matched records are transformed into traversal
records.
 A traversal record r = (e, t, tt, ge) indicates that edge e is traversed
by a trajectory trj starting at time t and has travel time tt and GHG
emissions ge.
 The VT-micro environmental impact model is used to estimate the
GHG emissions of each traversal record.

Initial Histogram Building
• Each edge is associated with a set of traversal records
• Divide the time space into intervals with equal width
 The default value is 1 hour, (24 intervals in total).
• For each edge e
 Build equi-width histograms for each time interval.
 The number of buckets per time interval is configurable.
 The histograms are isomorphic.
0.1
0.3
0.5
0.7
[0, 20) [20, 40) [40, 60) [60, 80]
Probability
GHG emissions (mL)
[8 a.m., 9 a.m.)
[9 a.m., 10 a.m.)

Histogram Merging
• For each edge, merge two temporally adjacent histograms
if they are sufficiently similar.
• Use cosine similarity to quantify similarity.
• We use a merge threshold Tmerge to decide when to stop
merging.
sim(Hi ,Hj ) =
V(Hi )ÄV(Hj )
V(Hi ) · V(Hj )
0.1
0.3
0.5
0.7
[0,20) [20,40) [40,60) [60,80]
Probability
GHG emissions (mL)
[8 a.m., 9 a.m.)
[9 a.m., 10 a.m.)
0.1
0.3
0.5
0.7
[0,20) [20,40) [40,60) [60,80]
Probability
GHG emissions (mL)
[8 a.m., 10 a.m.)

Histogram Bucket Reduction
• Further reduce the storage size of an individual histogram
by merging adjacent buckets.
 Use SSE to measure the merge cost (accuracy loss).
 Merge buckets when the cost does not exceed threshold Tred.
 Iteratively merge adjacent buckets in all the histograms of a road
segment.
0.1
0.3
0.5
0.7
[0,20) [20,40) [40,60) [60,80]
Probability
GHG emissions (mL)
[8 a.m., 10 a.m.)
0.1
0.3
0.5
0.7
[0,40) [40,60) [60,80]
Probability
GHG emissions (mL)
[8 a.m., 10 a.m.)

GHG Emissions Estimation
• For a route
 Estimate the distribution of GHG emissions as a histogram.
 Aggregate the histograms of the edges in the route.
• Given two histograms H1 and H2 for adjacent edges
 A histogram H′ is computed that represents the aggregated GHG
emissions distribution for traversing both edges.
21
'
HHH 
0.1
0.3
0.5
0.7
[0,20) [20,40)[40,60)
Probability
GHG emissions (mL)
e1
e2
0.1
0.3
0.5
0.7
[0,40) [40,80)[80,120)
Probability
GHG emissions (mL)
e1 + e2

Histogram Aggregation
• Given buckets bj in H1 and bk in H2
• A bucket bi in H’ is generated as
• Enumerate all bucket pairs of H1[j] and H2[k].
• Rearrange the buckets in H′ to obtain an equi-width
histogram.
H'
[i]. p= H1[ j]. p×å H2[k]. p
Range(H'
[i]) = Range(H1[ j])+ Range(H2[k])

Experiments
• Experimental settings
 Data Set: 200 million GPS records.
 Road network of Denmark: 414K vertices and 818K edges.
 GHG emissions: VT-micro is applied to evaluate vehicular
environmental impact.
• Evaluation
 Baseline: without any compression.
 State-of-the-art: T-drive.
 Accuracy, storage, and efficiency.

Storage Study
• Histogram Merging
• Bucket Reduction
MCRm =
Minit - Mmerge
Minit
0.6
0.7
0.8
0.9
1
0.9 0.92 0.94 0.96 0.98
MCR
Merge Threshold
MCRr =
Mmerge - Mredu
Mmerge 0.6
0.7
0.8
0.9
1
35 40 45 50 55 60
MCR
Bucket Limit
TMerge = 0.98
TMerge = 0.95

Accuracy Study
• Histogram Aggregation
• MCR vs. Err
0.8
0.9
1
2 6 10 14 18
HSimilarity
Number of Edges
0
0.1
0.2
0.3
0.4
35 40 45 50 55 60
Err
Bucket Limit
Tmerge = 0.98
Tmerge = 0.95
0.2
0.4
0.6
0.8
0.1 0.12 0.14 0.16 0.18
MCR
Err
ERN
T-drive
• Error Metric – Err
 The relative accumulative deviation of all the buckets in a
histograms from the original distribution.
0
0.1
0.2
0.9 0.92 0.94 0.96 0.98
Err
Merge Threshold
After Merging
Equi-width
• Histogram Merging
• Bucket Reduction

Summary
• We propose technique to estimate the time-dependent
environmental travel costs in a road network based on
histograms.
• An empirical study suggests that we can efficiently
compute eco-weights of road segments while retaining
accuracy.
• The proposed eco-weights can be used for eco routing
and real-time traffic monitoring.

Using Incomplete Information for
Complete Weight Annotation
of Road Networks

Overview
• Motivation
• Problem definition
• Cost modeling
• Objective functions
 Residual Sum of Squares
 Topology constraint
 Adjacency constraint
• Experiments

Motivation
• Why weight annotation?
 Vehicle routing relies on a weighted-graph representation of the
underlying road network.
 Given a graph with appropriate weights, different types of routes
can be computed.
• Why incomplete information?
 Weights that capture travel costs such as travel times can be
extracted from GPS trajectory data.
 GPS trajectory data typically does not cover all edges.
• Why complete weight annotation?
 To use a graph for routing, all edges must have weights.

Problem Definition
GPS Records
Pre-Processing Module
A set of (trip, cost) pairs
{(t(i), c(i))}
G''(V, E, null)
Weight Annotation Module
G(V, E, H)

Temporal Road Network Cost Modeling
 A road network is modeled as a directed graph.
 Temporal tags: during the same temporal tag, the traffic has similar
behavior, e.g., PEAK and OFFPEAK.
 Each edge has 2 cost values, which indicate the cost per unit
length for traversing the corresponding road segment during PEAK
and OFFPEAK, respectively.
 d(AB, PEAK) indicates the cost per unit length for traversing the road
segment AB during PEAK tag.
 d: a vector of size |E| * |TAGS| (e.g., 6 * 2 =12)
A
B C
D
d(AB, PEAK)
d(AB, OFFPEAK)
d(BC, PEAK)
d(BC, OFFPEAK)
d(BA, PEAK)
d(BA, OFFPEAK)
d(CB, PEAK)
d(CB, OFFPEAK)
d(BD, PEAK)
d(BD, OFFPEAK)
E
d(CE, PEAK)
d(CE, OFFPEAK)

Trip Cost Modeling
• GPS trajectory
 A sequence of records of the form (x,y,t).
• Trip
 A sequence of records of the form (edge, ts, te).
• The cost of a trip is the sum of the cost of each record.
 E.g., t1= <AB, 7:55, 8:00>, <BC, 8:00, 8:10>
 cost_model(t1) = d(AB, PEAK) * length(AB) + d(BC, OFFPEAK) *
length(BC)
• If we know all cost variables in d, we can compute the
estimated cost of any trip.

• For every trip t in T’
 The square of difference between the actual cost of t and the
estimated cost using the proposed model, i.e., cost_model(t).
 E.g., t1= <AB, 7:55, 8:00>, <BC, 8:00, 8:10>, actual_cost(t1) =
100.
 RSS(t1) = (100 – cost_model(t1))2
 Problem: Training data T’ may not cover the whole road network.
By minimizing the RSS function, we can only get cost variables for
the road segments traversed by T’.
d(BC, OFFPEAK)
Objective Functions: Residual Sum of Squares
A
B C
D
d(AB, OFFPEAK)
d(BC, PEAK)
d(BA, PEAK)
d(BA, OFFPEAK)
d(CB, PEAK)
d(CB, OFFPEAK)
d(BD, PEAK)
d(BD, OFFPEAK)
E
d(CE, PEAK)
d(CE, OFFPEAK)
d(AB, PEAK)

Objective Function: Topological Constraint
• If two road segments locate in similar topology in a road
network, they tend to have similar cost during the same
tag.
• Given a pair of road segments ei and ej
 The weighted square loss between the cost variables of ei and ej.
 weight_TC(ei, ej) * (d(ei, PEAK) – d(ej, PEAK))2
 If weight_TC (ei, ej) is big, d(ei, tag) and d(ej, tag) tend to be
similar in order to make (d(ei, PEAK) – d(ej, PEAK))2
small.
A
B C
D
d(AB, PEAK)
d(AB, OFFPEAK)
d(BC, PEAK)
d(BC, OFFPEAK)
d(BA, PEAK)
d(BA, OFFPEAK)
d(CB, PEAK)
d(CB, OFFPEAK)
d(BD, PEAK)
d(BD, OFFPEAK)
Ed(CE, OFFPEAK)
d(CE, PEAK)

Weighted PageRank Computation
• PageRank gives a value to each vertex in a graph.
• Primal graph -> dual graph
 Vertex -> Edge
 Edge -> Vertex
A
B C
D
E
BA CB
BD
AB BC
CE

Weighted PageRank Computation
• Original PageRank
 A vertex evenly gives its PR value to all its adjacent vertices.
 Vertex AB gives ½ * PR(AB) to BC and BD, respectively.
• Weighted PageRank
 Weight(AB, BD) =
𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐷 +1
𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐷 +𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐶 +|𝑑𝑒𝑔𝑟𝑒𝑒(𝐴𝐵)|
 100 trips from AB to BC, 50 trips from AB to BD
 weight(AB, BD) =
50+1
50+100+2
=
51
152
 weight(AB, BC) =
100+1
50+100+2
=
101
152
 No trips information
 weight(AB, BD) = weight(AB, BC) =
1
2
BA CB
BD
AB BC
CE

Weighted PageRank based Topological Constraint
• Use weighted PageRank value to quantify the topological
similarity.
 If two edges have similar weighted PageRank values, they also
tend to have similar cost.
 weight_TC(ei, ej) =
m𝑖𝑛⁡( 𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘 𝑒𝑖 ,⁡𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘(𝑒𝑗))
max⁡( 𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘 𝑒𝑖 ,⁡𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘(𝑒𝑗))

Objective Function: Adjacency Constraint
 If two road segments are directionally adjacent in a road network,
they tend to have similar cost during the same tag.
 d(AB, PEAK) may be similar to d(BC, PEAK)
 But not necessary to be similar to d(CB, PEAK) or d(BC, OFFPEAK)
 Given a pair of directionally adjacent road segments ei and ej
 The weighted square loss between the cost variables of ei and ej.
 weight (ei, ej) * (d(ei, PEAK) – d(ej, PEAK))
2
A
B C
D
d(AB, PEAK)
d(AB, OFFPEAK)
d(BC, PEAK)
d(BC, OFFPEAK)
d(BA, PEAK)
d(BA, OFFPEAK)
d(CB, PEAK)
d(CB, OFFPEAK)
d(BD, OFFPEAK)
E
d(CE, PEAK)
d(CE, OFFPEAK)
d(BD, PEAK)

Objective Functions
• Residual Sum of Squares
 RSS(d) = (actual_cost(t) – cost_model(t))2
t⁡∈T′
• Topological Constraint
 PRTC(d) = 𝑤𝑒𝑖𝑔𝑕𝑡_𝑇𝐶 𝑒𝑖, 𝑒𝑗 ∗⁡(𝑑 𝑒𝑖, 𝑡𝑎𝑔 − 𝑑(𝑒𝑗, 𝑡𝑎𝑔))2
𝑖,𝑗⁡
• Adjacency Constraint
 ACTC(d) = 𝑤𝑒𝑖𝑔𝑕𝑡 𝑒𝑖, 𝑒𝑗 ∗⁡(𝑑 𝑒𝑖, 𝑡𝑎𝑔 − 𝑑(𝑒𝑗, 𝑡𝑎𝑔))2
𝑖,𝑗⁡
• Overall objective function
 The sum of the above three (plus a regularizer term)
• Differentiating the objective function w.r.t. vector d and
setting it to 0 yields an equation that can be solved using
standard techniques to obtain a solution for d.

Experimental Results
• Two weeks GPS trajectories from SPF
• 18K trips in total, and 20% as training, 80% as testing
• Two road networks: Skagen and North Jutland
• Objective functions used
 F1: RSS; F2: RSS + PRTC; F3: RSS + ACTC
 F4: RSS + PRTC + ACTC

Travel Cost Inference from Sparse,
Spatio-Temporally Correlated Time
Series Using Markov Models

Outline
• Motivation
• Travel-cost time series
• Modeling travel-cost time series in a road network
• Learning a spatio-temporal hidden Markov model
• Empirical study
• Summary

• Vehicle routing plays an important role in transportation.
 E.g., fastest routes, eco-routes.
• Effective vehicle routing relies on a weighted graph
representation of a road network.
 The weight of an edge should accurately capture the travel cost of
traversing the edge.
 E.g., travel times, greenhouse gas (GHG) emissions.
• Edge weights should be dynamically changing due to the
dynamics of traffic.
To support dynamic edge weights, we propose
techniques that are able to infer near-future travel costs
based on incoming GPS data.
Motivation

Travel-Cost Time Series
• The travel costs on edge ei are modeled as a
time series TSi = <Ci
(1), Ci
(2), …, Ci
(T)>.
• Travel cost set Ci
(j) contains travel costs
observed in the j-th time interval on edge ei.
 An interval is the finest time interval of interest.
 E.g., 15 minutes, yielding 96 intervals per day.
 A travel cost is the travel time or GHG emissions of a
traversal of edge ei during the j-th interval.
 E.g., during the j-th interval [8:00, 8:15), if 3 vehicles
traversed ei taking 30, 32, 40 s, Ci
(j) = {30, 32, 40}.
e1
e2
e3
TS1
TS2
TS3

Time Series Characteristics
• Dependent
 Travel costs on adjacent edges may be dependent.
• Heterogeneous
 The traffic on different edges is often different.
 Some edges have clear peak/off-peak periods while others do not.
• Sparse
 Some intervals do not have any travel cost observations.

Modeling Travel-Cost Time Series
• We use a Spatio-Temporal Hidden Markov Model
(STHMM) to model the interactions among multiple travel-
cost time series in a road network.
State transition
on edge e1
s1
t-1=
cong
s1
t=
cong
s1
t+1
= clr
C1
t-1 C1
t C1
t+1Time Series:
TS1
TP TP
OP OP OP
An edge ei has a set of hidden states Si, where each
state indicates a particular traffic status on the edge.
S1={cong, clr}
During an interval, the traffic of an edge is in a state.
Intervals: t-1 (previous) t (current) t+1 (next)
The traffic on an edge evolves over time, transiting
from one state to another according to transition
probabilities (TP).
In an interval, the travel costs observed on an edge
is dependent on the state of the edge according to
output probabilities (OP).

Modeling Multiple Time Series using STHMM
e1
e2
e3
e1
e2
e3
s1
t-1=
cong
s1
t =
cong
s1
t+1=
clr
s2
t-1=
clr
s2
t =
clr
s2
t+1=
cong
s3
t-1=
cong
s3
t =
clr
s3
t+1=
clr
Intervals: t-1 (previous) t (current) t+1 (next)
TP: P(S1
t|S1
t-1, S2
t-1)
TP: P(S2
t|S1
t-1, S2
t-1, S3
t-1)
TP: P(S3
t|S2
t-1, S3
t-1)
Capture the traffic dependency among edges.

Travel-Cost Time
Series Derived from
Historical GPS Data
Framework Overview
State Formulation &
Parameter Learning
Spatio-Temporal
Hidden Markov Model
Routing Algorithms
Near Future Edge
Weights
Real-Time
GPS Data
Eco-routes &
Fastest-routes

State Formulation – Intuitions
• Since edges are heterogeneous, we need to identify a
unique state set for each edge.
 Edges with heavy and complicated traffic: more states
 Edges with light and simple traffic: fewer states
• Target properties of the state set of an edge
 The traffic in the same state behaves similarly.
 The current state depends on its previous state.

State Formulation – Methods
• For each edge, identify a Gaussian Mixture Model (GMM)
with K Gaussian components to capture the distribution of
all the travel costs on the edge.
 Due to heterogeneity, different edges may have different Ks.
• Transform the travel costs in each interval in the edge’s
time series into a K+1 dimensional point.
 The first K dimensions correspond to the travel cost distribution in
the interval, still using the K Gaussian components.
 The K+1st dimension captures the difference between the travel
cost distributions in the interval and its previous interval.
• Cluster these K+1 dimensional points into several clusters,
and each cluster is a state.

Parameter Learning
• Output probabilities
 Since each state is a cluster, we use the cluster center to derive a
GMM as the output probability for the state.
• Transition probabilities
 For each edge, we need the edge’s travel-cost time series, along
with its adjacent edges’ travel-cost time series.
 Based on the obtained output probabilities, transition probabilities
are then estimated using maximum likelihood estimation.

Travel Cost Inference based on an STHMM
e1
e2
e3
Est.
s1
t
Est.
s2
t
Est.
s2
t+1
Est.
s3
t
Intervals: t (current) t+1 (next)
Real Time
GPS data
On e3
Real Time
GPS data
On e2
Real Time
GPS data
On e1
OP +
Bayes
OP +
Bayes
OP +
Bayes
OP
Est.
Travel
Costs
TP
TP
TP

Empirical Settings
• Data
 181 million GPS records from North Jutland, Denmark from April
2007 to March 2008.
 The data is split into training and testing sets.
• Periods of interest
 Weekdays during [6:00, 20:00).
• Road network
 1,916 edges in North Jutland with more than 500 records.
• Travel costs
 Travel time and GHG emissions.
• Accuracy measurements
 Average sum of squared loss (ASSL) between estimated costs and
real costs.

Experimental Study
• Aspects studied
 Effect of the length of interval.
 Effect of the sizes of training data
 Effect of recent data
 Effect of seasonality
• Results
 Efficient; able to support real-time weight updates
 Accurate; outperforms a baseline and the state-of-the-art

Summary
• Described an STHMM that is able to model multiple
correlated, sparse time series.
 Dependent: STHMM couples the states from adjacent edges.
 Heterogeneous: State formulation identifies a unique state set for
each edge.
 Sparse: Smoothing techniques are used to contend with sparse
data.
• The described framework is able to support real-time
weight updates and thus provide more accurate routing
services, e.g., eco-routing.

Stochastic Skyline Route Planning
Under Time-Varying Uncertainty

Outline
• Motivation and challenges
• Road network modeling
• Stochastic skyline route planning
• Empirical studies
• Summary

• Reduction in greenhouse gas (GHG) emissions is
important to the whole society.
• Eco-routing is a simple yet effective way to achieve the
reduction from transportation section.
 Of interest to both fleet owners and individual drivers.
 For example, FlexDanmark is interested in using most eco-friendly
routes while considering travel times and distances.
• We aim to provide useful eco-routing services.
Motivation

Challenges
• To provide practically useful eco-routing, we need to
contend with three challenges.
• Multiple travel costs
 Consider not only GHG emissions, but also distances and travel
times.
• Time dependence
 Some travel costs are time-dependent.
 For example, travel times on the same road may vary during peak
vs. off-peak periods.
• Uncertainty
 Some travel costs are uncertain.
 For example, aggressive and moderate driving on the same road
may produce different GHG emissions.

A novel road network model - MTUG
• MTUG is a Multi-cost, Time-dependent, Uncertain Graph.
• Assume N different costs are of interest in total.
 Distance (DI), travel time (TT), GHG emissions (GE).
• G=(V, E, MM, W)
 V is the vertex set, and E is the edge set.
 MM = <MM(1), … , MM(N)>
 Function MM(i) maps an edge to the minimum and maximum i-th
cost of using the edge.
 MM(TT) (ea) = (150 seconds, 500 seconds)
 MM(GE) (eb) = (10 ml, 85 ml)
 W = <W(1), … , W(N)>
 Function W(i) maps an edge to a set of (interval, random variable)
pairs of the i-th cost type.
 W(TT) (ea) = {([0:00, 7:15), N(300, 120)), ([7:15, 8:45), N(450, 100)), …}
 W(GE) (eb) = {([0:00, 7:00), N(30, 100)), ([7:00, 9:00), N(50, 80)), …}

Instantiation of MM in an MTUG
• MM and W are instantiated using GPS records.
• GPS records are map matched to edges.
• Each edge is associated with a set of edge records in the
form of (e, t, C).
 An edge record indicates that a traversal on edge e at time t takes
costs C, where C is a vector of all costs of interest.
 (e1, 8:08, <55 seconds, 80 ml>)
 (e1, 9:18, < 45 seconds , 63 ml>)
 (e1, 10:10, < 43 seconds , 60 ml>)
 (e1, 21:03, < 45 seconds , 62 ml>)
• Based on the edge records on an edge, functions MM on
the edge can be instantiated.
 MM(TT) (e1) = (43 seconds, 55 seconds)
 MM(GE) (e1) = (60 ml, 80 ml)

Instantiation of W in an MTUG
• Partition a day into 96 15-min intervals.
• For each (edge, interval) pair, we obtain a multi-set containing
the costs on the edge during the interval.
 ms={(10 s, 3), (20 s, 10), (25 s, 20), (30 s, 10), (40 s, 5), …}
• Estimate a random variable (RV) based on the multi-set.
 Use a Gaussian Mixture Model (GMM) to represent an RV.
 GMMs can approximate arbitrary distributions.
 A GMM is a weighted sum of K Gaussian distributions.
 GMM(x) = 𝑚 𝑘 ⁡∙ 𝑁(𝑥|µ 𝑘, δ 𝑘
2
)𝐾
𝑘=1

Instantiation of W in an MTUG (cont.)
• If two RVs in two adjacent intervals are similar, we combine
the two intervals into a long interval.
 Use KL-divergence to measure the similarity between two RVs.
• Re-estimate a new RV for the long interval using the costs in
the long interval.
• The whole procedure works iteratively until no RVs from
consecutive intervals are similar enough to be combined.
• The long intervals along with their RVs instantiate W.

Route Costs in MTUG
• Given a route Ri=<r1, r2, …, rX>, where ri ∈ E is an edge.
• RC(Ri, t) indicates the costs of using route Ri at time t
 RC(Ri, t) = <RVDI, RVTT, RVGE> is a vector of RVs, and each RV
corresponds to a travel cost.
• RVDI is a deterministic value, which equals to the sum of
the length of each edge in route Ri.
• RVTT is the convolution of the corresponding travel time
RV of each edge in route Ri.
 Deciding the travel time RV of the first edge r1 is dependent on the
trip start time t.
 Deciding the travel time RV of the k-th edge rk is dependent on the
travel time of the previous k-1 edges, which may be uncertain.
• RVGE is the convolution of the corresponding GHG
emission RV of each edge in route Ri.

Stochastic Dominance
• Stochastic Dominance between two RVs
 Given two RVs X and Y, if cdfX(a) >= cdfY(a), for all possible value
a in R+, we say ―X stochastically dominates Y‖.
 RC(R1, t).RVTT stochastically dominates RC(R2, t). RVTT.
 No stochastic dominance between RC(R3, t). RVTT and RC(R4, t).
RVTT.

Stochastic Skyline Routes
• Route Ri dominates route Rj iff
 No RV of cost(Rj) stochastically dominates the corresponding RV
of RC(Ri).
 At least one RV of RC(Ri, t) stochastically dominates the
corresponding RV of RC(Rj, t).
• Stochastic skyline route
 Given a source-destination pair, and a trip starting time.
 Stochastic skyline routes are the routes that are not dominated by
any other routes.

Trajectories
Framework Overview
Pre-Processing
Road Network
MTUG
Source,
Destination,
Time
RoutingModule
StochasticSkylineRoutes
Cost Records
Offline Phase Online Phase
Instantiating MM and W

Stochastic Skyline Route Planning
• A brute force method
 Enumerate all possible routes, compute the route costs, and check
whether one route dominates another
 Very inefficient, and only works for small road networks
• An efficient method
 Prune some routes that cannot become skyline routes early.
 Efficient stochastic dominance checking

Early Pruning Strategy
• Do the following for all travel cost types of interest.
 We use travel time as an example.
 We maintain a graph where each edge is associated with the
minimum travel time, which is recorded in MM.
 From the destination, run Dijkstra’s algorithm on the graph.
 As each vertex is associated with the minimum travel time, we get the
―fastest‖ possible route from the source to the destination.
 Compute the route cost of the ―fastest‖ possible route and add the
route as a candidate Skyline route.
vs vd
Each vertex vx has the
shortest distance
least travel time
least GHG emissions
to the destination vd
v1
v2
v3
Candidate skyline route 1

Early Pruning Strategy (cont.)
• Explore routes from source, until no more routes can be
explored.
 Estimate the least possible travel costs for a partially explored
route R’.
 If the partially explored route with its estimated least possible costs
is dominated by an existing candidate skyline route, there is no
need to explore the route any further.
 Otherwise, continue exploring.
 Update candidate skyline route if necessary.
vs vd
v1
v2
v3
Candidate skyline route 3R’

Stochastic Dominance Checking
• Naïve way: check according to the definition of stochastic
dominance.
 For each value a, check whether cdfX(a) >= cdfY(a).
• An efficient way
 Consider one cost type at a time.
 Compute the minimum and maximum possible travel costs of a
route.
• Distinguish among three cases based on the min and max
travel costs of two routes.
 Disjoint case: dominance.

Stochastic Dominance Checking (cont.)
• Covered case: non-dominance.
• Overlapping case (needs further checking)
 Both dominance and none-dominance may occur.

Empirical Settings
• GPS Data
 More than 180 million GPS records collected from Denmark in
2007 and 2008.
 1 Hz sampling rate: one GPS record per second.
• Road networks (from OpenStreetMap)
 Aalborg (AA): 4,981 vertices and 12,614 edges.
 North Jutland (NJ): 68,318 vertices and 163,546 edges.
 Jutland (JU): 667,876 vertices and 1,622,974 edges.
 AA is the largest city in NJ, and NJ is part of JU.
• Travel costs
 Distances (DI): obtained from OpenStreetMap.
 Travel time (TT): computed from the timestamps in the GPS
records.
 GHG emissions (GE): computed from the instantaneous speeds
and accelerations from the GPS records.

MTUG Generation
• Hot edges: edges covered by the GPS data set.
• Instantiate MM and W based on the GPS data.
• Cardinalities of TT and GE intervals.
 Largest cardinalities: 65 for TT and 72 for GE.
 Average cardinalities:
• Cold edges: edges not covered by the GPS data set.
 Derive corresponding travel costs using speed limits.
The traffic in AA is much more
dynamic that that in other parts.

Stochastic Skyline Route Queries
• Query settings:
• 50 random source-destination pairs with a starting time in
each setting.
• Study the cardinality of the stochastic skyline routes and
the run time.

Cardinalities of Skyline Routes
• As the number of travel cost types considered increases,
the cardinality of the skyline routes also increases.
• When distance between source and destination increases,
the cardinality of the skyline routes also increases.

Run Time
• The run-time for a query is below 5 seconds, and most
queries are computed in less than 2 seconds.
• Advanced dominance checking method is effective and
efficient.
 Benefits of using the advanced dominance checking is more
substantial on JU since long routes require more dominance
checks.

Summary
• Described a framework that enables stochastic skyline
route planning in road networks with multiple, time-
dependent, and uncertain travel costs.
• Enables eco-routing in a realistic setting.

Vehicle Routing with
User-Generated Trajectory Data

Motivating Example
Fastest route: Route #1
Shopping
Mall
Food
Store
50
70
S
E
B
F
GA
I
H
J
K D
Route #1

Motivating Example
Shortest route: Route #2
Shopping
Mall
Food
Store
50
70
S
E
B
F
GA
I
H
J
K D
Route#2

Motivating Example
Route #3 avoids parking lots
Shopping
Mall
Food
Store
50
70
S
E
B
F
GA
I
H
J
K D
Route#3

Motivating Example
Route #4 is attractive when stores are closed
Shopping
Mall
Food
Store
50
70
S
E
B
F
GA
I
H
J
K D
Route #4

Motivating Example
Which route from source S to destination D is preferable?
Shopping
Mall
Food
Store
50
70
S
E
B
F
GA
I
H
J
K D
Route #1
Route#3
Route#2
Route #4
The fastest – takes the least travel time?
The shortest – smallest distance?
The most ―direct‖– fewest turns?
The one most constrained to main roads –
better road quality?
The ―safest,‖ avoiding bad neighborhoods?
…
The one most preferred by locals?

Introduction
• Local travel
 Knowledge of the surroundings
 Follow familiar routes
• Travel in unfamiliar surroundings to unknown destinations
 Depend on available routing services
 Expect that the provided route is the best
• Idea: Use GPS data to let those who travel in unfamiliar
surroundings benefit from the insights of local travelers

Motivation
• Can’t we just use a standard routing service?
• How often do drivers follow routes recommended by
routing a service?
V. Čeikutė and C.S. Jensen. Routing service quality—local driver behavior versus routing services. MDM 2013, pp. 97–106
0
500
1000
1500
2000
2500
0.5-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-220
#routes
route length (km)
Routes
5000 routes

Motivation
• How often do drivers follow routes recommended by
routing a service?
• Breakdown by distance traveled
73%
56%
44%
36%
19%
9%
18%
20%
11%
0
20
40
60
80
100
0.5-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-220
onavgmatch(%)
route length (km)
V. Čeikutė and C.S. Jensen. Routing service quality—local driver behavior versus routing services. MDM 2013, pp. 97–106
• ~60% of routes match the
route provided by the routing
service; ~40% do not!
• Use of previous user’s trips
can improve routing quality

Goal of the Study
• Propose a routing framework that
 Utilizes GPS data volunteered by local drivers
 Exploits possibly hard-to-formalize insight into local conditions
 Takes into account temporal variation in driver behavior
 Recommends routes based on popularity and temporal aspects
• Evaluate the quality of proposed routes
 Study based on trip length and pre-selected drivers
 Quality comparison with existing routing service and route
recommendation approaches

Outline
• Introduction and motivation
• Framework
• Data preparation methodology
• Scoring of routes
• Empirical study
 Data foundation
 Routing quality evaluation and comparison
• Related work
• Conclusion and research directions

Framework
offline
GPS logs map-
matching
trajectory
DB
Routing
service
online
trajectory
data
user’s
input data
top route
user
GPS trace
• source
• destination
• time
Trips used:
• Start or end at, or go through,
source and destination locations
• Trips that start during the issued
time period are preferred
When the source and
destination are not covered
by available GPS data, an
existing routing service is
used.

Data Preparation Methodology
• A trip is a sequence of
timestamped GPS points
between stay points.
• Stay point
 A location where the user stays
for a certain period time (no
GPS data is received)
 A location reported repeatedly
for a certain period of time (the
engine remained on)
• In practice, trips are
consolidated
 Short trips are dropped and
some consecutive trips are
mergedx
y
t
same
location

• Trips are map-matched and
transformed into sequences of
road segments with entry and
exit timestamps
• Example
 Trip 𝑡𝑟1is map-matched to a
sequence of road segments:
(𝑠1, 𝑡 𝑠1
, 𝑡 𝑒1
), (𝑠2, 𝑡 𝑠2
, 𝑡 𝑒2
), (𝑠3, 𝑡 𝑠3
, 𝑡 𝑒3
)
𝑡𝑟3
𝑡𝑟2
𝑡𝑟1
𝑠1
𝑠2
𝑠3

The similarity of two trips is
determined by applying
Longest Common
Subsequence (LCSS) to the
trajectories represented as
sequences of road segments.
s1
s2
s3 s4
s1
s2
s3
s4 s11 s9 s10
s10s9
s5
s6 s7
s8
Similar trips belong to the same route.
𝑆1 𝑝𝑙𝑠, 𝑝𝑙𝑠′
=
𝑙𝑒𝑛𝑔𝑡ℎ(𝐿𝐶𝑆𝑆(𝑝𝑙𝑠,𝑝𝑙𝑠′))
𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑙𝑠)
∙ 100%
𝐶1 𝑝𝑙𝑠, 𝑝𝑙𝑠′
=
𝑡𝑟𝑢𝑒⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡𝑖𝑓⁡𝑆1 𝑝𝑙𝑠, 𝑝𝑙𝑠′ ≥ 𝑔𝑒𝑡𝑀𝑎𝑡𝑐𝑕(𝑝𝑙𝑠, 𝑝𝑙𝑠′)⁡
𝑓𝑎𝑙𝑠𝑒⁡⁡⁡⁡⁡⁡⁡𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡
Similarity of two
trajectories
The similarity threshold
depends on the trip length.
Longer trips, higher
similarity thresholds,
interval [90-100].
Do trips follow the same route?

• Trips that follow the same sequence of road segments are
grouped into route usage objects.
user route # traversals
𝑢1 𝑟1: 𝑠 𝐵𝐶, 𝑠 𝐶𝐷, 𝑠 𝐷𝐹, 𝑠 𝐹𝐺 - 10off
𝑢2 𝑟2: 𝑠 𝐶𝐷, 𝑠 𝐷𝐹, 𝑠 𝐹𝐺 3peak 5off
𝑢3 𝑟3: 𝑠 𝐸𝐷, 𝑠 𝐷𝐹, 𝑠 𝐹𝐼, 𝑠𝐼𝐺 7peak 4off
𝑢4 𝑟2: 𝑠 𝐶𝐷, 𝑠 𝐷𝐹, 𝑠 𝐹𝐺 5peak 6off
A B
𝑟1
𝑟2
𝑟3
E D F
I
G
C
• Route 𝑟2 is taken by users 𝑢2 and 𝑢4 3 and 5 times during
peak hours and 5 and 6 times during off-peak hours.
(𝑟2, 𝑝𝑙, 𝑝𝑒𝑎𝑘, 𝑢2, 3 , 𝑢4, 5 , 𝑜𝑓𝑓, 𝑢2, 5 , 𝑢4, 6 )

Scoring of Routes
Preferred routes
• Popular among drivers
• Taken by many distinct drivers
• Popular on the time of the day
and day of the week of the query
peak hours
off-peak hours
destination
source
B
A

Final route score:
𝑠𝑐𝑜𝑟𝑒 𝑟 = 𝛽 ∙ 𝑝𝑟𝑒𝑓 𝑀 𝑟 + (1 − 𝛽) ∙ 𝑝𝑟𝑒𝑓 𝑁(𝑟)
Scoring of Routes
Route preference value:
𝑝𝑟𝑒𝑓 𝑟 = 𝛼 ∙ 𝑢𝑠𝑒𝑟𝑠 𝑟 + (1 − 𝛼) ∙ 𝑡𝑟𝑎𝑣𝑒𝑟𝑠𝑎𝑙𝑠(𝑟)
# distinct drivers
taking the route
# of traversals
of the route
Considers trips taken during
the query temporal pattern
Considers trips taken during
other temporal patterns

Empirical Study: Data
• Monitoring period: 2 years
• Number of drivers: 285
• Number of GPS points (raw data): ~182,700,000
• Number of trips: ~275,000
• ―Pay as You Speed‖ project (http://www.sparpaafarten.dk)
0
20000
40000
60000
80000
100000
120000
140000
2-10 10-20 20-30 30-40 40-50 50-350
#traversals
traversal length (km)
Afternoon peak
Morning peak
Off-peak

Routing Quality Evaluation: Data
For this study, we randomly selected equal amounts of trips
for different trip length intervals
0
100
200
300
400
500
600
700
2-10 10-20 20-350
#traversals
traversal length (km)
Off-peak
Morning peak
Afternoon peak

Routing Quality: Match
A match is identified using LCSS
Trajectory
DBtest
Routing
Service
source and
destination
Compare
route with
trajectory
preferred
route
trajectory
Total
number of
matches match!
C
A
B
Trajectory
DB

Routing Quality Evaluation: Results
Our Proposal Google Directions API (top-1)
TPMFP (SIGMOD’13)
0
20
40
60
80
100
2-10 10-20 20-350
%
length (km)
Unmatched
Matched
0
20
40
60
80
100
2-10 10-20 20-350
%
length (km)
Unmatched
Matched
0
20
40
60
80
100
2-10 10-20 20-350
%
length (km)
Empty
Unmatched
Matched

Routing Quality Evaluation: Data
For this study, we considered the five drivers with the most
trips.
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5
#traversals
driver
2-10 km
10-30 km
30-300 km
0
1000
2000
3000
4000
5000
1 2 3 4 5#traversals
driver
Off-peak
Afternoon peak
Morning peak
Traversals per Length Traversals per Temporal Pattern

Routing Quality Evaluation: Results
Our proposal
0%
20%
40%
60%
80%
100%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
source and dest. source destination subroutes
Different Types of Route Calculation
Matched Unmatched Empty
0
20
40
60
80
100
1 2 3 4 5
%
driver
unmatched matched
Google Directions API (top-1)

Related Work
[1] K.-P. Chang, L.-Y. Wei, M.-Y. Yeh, and W.-C. Peng. Discovering personalized routes from trajectories. LBSN 2011, pp. 33–40
[2] Z. Chen, H. T. Shen, and X. Zhou. Discovering popular routes from trajectories. ICDE 2011, pp. 900–911
[3] W. Lou. H. Tan, L. Chen, and L.M. Ni. Finding time period-based most frequent path in big trajectory data. In SIGMOD 2013,
pp. 713–724
[4] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang. T-Drive: Driving directions based on taxi trajectories. In GIS
2010, pp. 99–108
• Four existing routing techniques use drivers’ trajectories
[1],[2],[3],[4]
 The road network is formed from the road segments that are
covered by the trajectory data set
 A route is formed by
 Prioritizing parts of roads that are followed the most by a specific
driver (personalized routes) [1]
 Prioritizing parts of roads that are taken by other drivers [2],[4]
 Possibly using sub-routes from multiple routes [3]
 Trajectories used for scoring must contain the destination and
must start and end during the provided time interval. [3]
 Suggested routes are formed from the most popular routes or
route parts in the available data set.

Conclusion and Research Directions
Conclusions
• The proposed framework utilizes trajectory data collected from local
drivers for routing.
• A preferred route is selected using a flexible scoring function that
considers
 The number of traversals of the route
 The number of distinct drivers taking the route
 The time periods when the traversals occurred
• Use of travel histories of local drivers can increase routing quality
• More details in the paper!
Research directions
• Additional aspects of the framework can be considered
 Efficiency of route identification process (LCSS technique)
 Inclusion of personalized routes
 Better support for routes that are constructed from sub-routes

Demos and Prototype Systems
• EcoTour: http://daisy.aau.dk/its/
 Computes and compares the shortest, the fastest, and the most
eco-friendly routes for arbitrary source-destination pairs in DK.
 Best demo award at IEEE MDM 2013.
• EcoSky: http://daisy.aau.dk/its/eco/
 Supports skyline eco-routing and personalized eco-routing
• Sheafs: http://daisy.aau.dk/its/sheaf
 Trajectory based traffic sheafs
• Strict-Path Queries: http://daisy.aau.dk/its/spqdemo
 Trajectory based, Strict-Path Queries
 Trips (historical travel-time), route choice, Napoleon (road usage)
• iPark: identifying parking spaces from GPS trajectories
 On-street parking lanes vs. parking zones

The Future
• Much more data
 Inductive loop detectors
 Bus data
 Rejsekortet
• Much more connected vehicles
• New services
 Routing
 Safety and warnings
 Parking, fees, insurance, road pricing
 Car sharing, multi-modality
• Driver-less vehicles

Challenges
• Detect ―black spots‖ before they occur.
• Modeling spatio-temporal congestion from data
• Characterize the effects of events
 Accidents, malfunctioning of traffic signals, rain, a concert
• Real-time traffic management
 In response to current or predicted situation, actuate traffic signals
and drivers (via their smartphones or navigation devices) to
optimize the use of the infrastructure and driver experience
• Automated trade-off between weight level of detail and
available data.
• Stochastic routing at 20 milliseconds.
• Integrate with ―point‖ data.

Acknowledgments
• Colleagues at Aalborg University, Aarhus University, and
beyond.
• The EU FP7 project, Reduction: http://www.reduction-
project.eu/
• The Obel Family Foundation: http://www.obel.com/en

Readings
• V. Ceikute, C. S. Jensen: Vehicle Routing with User-Generated
Trajectory Data. MDM (1) 2015
• C. Guo, B. Yang, O. Andersen, C. S. Jensen, K. Torp: EcoMark 2.0:
empowering eco-routing with vehicular environmental models and
actual vehicle fuel consumption data. GeoInformatica 19(3):567-599
(2015)
• C. Guo, Bin Y., O. Andersen, C. S. Jensen, K. Torp: EcoSky:
Reducing vehicular environmental impact through eco-routing. ICDE
2015:1412-1415
• B. Yang, C. Guo, Y. Ma, C. S. Jensen: Toward personalized, context-
aware routing. VLDB J. 24(2):297-318 (2015)
• Y. Ma, B. Yang, C. S. Jensen: Enabling Time-Dependent Uncertain
Eco-Weights For Road Networks. GeoRich@SIGMOD 2014:1:1-1:6
• B. Yang, C. Guo, C. S. Jensen, M. Kaul, S. Shang: Stochastic skyline
route planning under time-varying uncertainty. ICDE 2014:136-147
• B.- Yang, M. Kaul, C. S. Jensen: Using Incomplete Information for
Complete Weight Annotation of Road Networks. IEEE TKDE
26(5):1267-1279 (2014)
• C. Guo, C. S. Jensen, B. Yang: Towards Total Traffic Awareness.
SIGMOD Record 43(3):18-23 (2014)

Readings
• V. Ceikute, C. S. Jensen: Routing Service Quality - Local Driver Behavior Versus
Routing Services. MDM (1) 2013: 97-106
• M. Kaul, B. Yang, C. S. Jensen: Building Accurate 3D Spatial Networks to Enable
Next Generation Intelligent Transportation Systems. MDM 2013: 137-146, best
paper award.
• Ove Andersen, Christian S. Jensen, Kristian Torp, Bin Yang: EcoTour: Reducing
the Environmental Footprint of Vehicles Using Eco-routes. MDM 2013: 338-340,
Demo paper, best demo award.
• B. Yang, C. Guo, C. S. Jensen: Travel Cost Inference from Sparse, Spatio-
Temporally Correlated Time Series Using Markov Models. PVLDB 6(9): 769-780
(2013)
• C. Guo, Y. Ma, B. Yang, C. S. Jensen, Manohar Kaul: EcoMark: evaluating models
of vehicular environmental impact. SIGSPATIAL/GIS 2012: 269-278
• L. Speicys, C. S. Jensen: Enabling Location-based Services - Multi-Graph
Representation of Transportation Networks. GeoInformatica 12(2): 219-253 (2008)
• A. Brilingaite, C. S. Jensen: Enabling Routes of Road Network Constrained
Movements as Mobile Service Context. GeoInformatica 11(1): 55-102 (2007)
• C. Hage, C. S. Jensen, T. B. Pedersen, L. Speicys, I.Timko: Integrated Data
Management for Mobile Services in the Real World. VLDB 2003: 1019-1030

Christian jensen advanced routing in spatial networks using big data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Christian jensen advanced routing in spatial networks using big data

Ähnlich wie Christian jensen advanced routing in spatial networks using big data (20)

Mehr von jins0618

Mehr von jins0618 (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Christian jensen advanced routing in spatial networks using big data