Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Integrating Sensor and Social Data for Understanding City Events
1. Semantic Approach to
Big Data and Event Processing
Integrating Sensor and Social Data
for Understanding City Events
Pramod Anantharam
Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)
Wright State University, USA
Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
4. • Why?
– Provides Complementary information for
comprehensive situational awareness
• Sensor : Social :: Quantitative vs Qualitative
– Corroboration can further improve trustworthiness
• What?
– Collect and relate multimodal sensors data and social
media data
• How?
– Correlate heterogeneous data streams exploiting
spatio-temporal proximity and domain knowledge
T. K. Prasad 4
Multimodal Data Integration
5. • Why?
– Explain/Interpret average speed and link travel time
data using event schedule provided by city authorities
and real-time traffic events shared on Twitter
– Past work: Predict congestion based on historical
sensor data
• What?
– Combine
• 511.org data about Bay Area Road Network Traffic
– E.g., Average speed and link travel time data stream
– E.g., (Happened or planned) event reports
• Tweets that report events including ad hoc ones
T. K. Prasad 5
Traffic Domain Use Case (open data)
6. • How?
– Extract events from textual tweets stream
– Build statistical models of normalcy, and thereby
anomaly, from numerical sensor data streams
– Correlate multimodal streams, using spatio-
temporal information, to annotate “anomalies” in
sensor data time series with textual events
T. K. Prasad 6
Traffic Domain Use Case (open data)
7. • How?
– Extract events from textual tweets stream
– Build statistical models of normalcy, and thereby
anomaly, from numerical sensor data streams
– Correlate multimodal streams, using spatio-
temporal information, to annotate “anomalies” in
sensor data time series with textual events
T. K. Prasad 7
Traffic Domain Use Case (open data)
9. Some Challenges in Extracting Events from Tweets
• No well accepted definition of ‘events related to a
city’
• Tweets are short (140 characters) and its informal
nature make it hard to analyze
– Entity, location, time, and type of the event
• Multiple reports of the same event and sparse
report of some events (biased sample)
– Numbers don’t necessarily indicate intensity
• Validation of the solution is hard due to the open
domain nature of the problem
9
10. Formal Text Informal Text
Closed Domain
Open Domain [Roitman et al. 2012][Kumaran and Allan 2004]
[Lampos and Cristianini 2012]
[Becker et al. 2011]
[Wang et al. 2012]
[Ritter et al. 2012]
Related Work on Event Extraction
10
11. 11
[ABTA-14] Pramod Anantharam, Payam Barnaghi, Krishnaprasad Thirunarayan, and Amit Sheth. 2015. Extracting City Traffic Events from Social Streams.
ACM Trans. Intell. Syst. Technol. 6, 4, Article 43 (July 2015), 27 pages. DOI=10.1145/2717317 http://doi.acm.org/10.1145/2717317
City Event Extraction from Textual Data
12. • City Event Annotation
– Automated creation of training data
– Annotation task (our CRF model vs. baseline CRF
model)
• City Event Extraction
– Use aggregation algorithm for event extraction
– Extracted events AND ground truth
• Dataset (Aug – Nov 2013) ~ 8 GB of data on disk
– Over 8 million tweets
– Over 162 million sensor data points
– 311 active events and 170 scheduled events
Evaluation
12
13. 13
Evaluation Metric For Comparing Events with Ground Truth:
• Complementary Events
• Additional information e.g., slow traffic from sensor data and accident from
textual data
• Corroborative Events
• Additional confidence e.g., accident event supporting a accident report from
ground truth
• Timeliness
• Early detection e.g., knowing poor visibility before its formal report
Distribution of Extracted Events Over Locations
17. • How?
– Extract events from textual tweets stream
– Build statistical models of normalcy, and thereby
anomaly, from numerical sensor data streams
– Correlate multimodal streams, using spatio-
temporal information, to annotate “anomalies” in
sensor data time series with textual events
T. K. Prasad 17
Traffic Domain Use Case (open data)
18. Image credit: http://traffic.511.org/index
Multiple events
Varying influence
interact with each other
Focus of this talk: algorithms to understand these manifestations
18
Correlating Multimodal Streams: Preliminary Insights
19. • Causes of non-linearity in sensor data
streams
– Temporal landmarks : peak hour vs off-peak traffic
vs weekend traffic
– Effect of location
– Scheduled events such as road construction,
baseball game, or music concert
– Unexpected events such as accidents or heavy
rains
– Random variations (viz., stochasticity)
T. K. Prasad 19
Traffic Dependencies
20. • Disclaimer
"All models are wrong, but some are useful.” - George Box
• Normalcy Model
– Gaussian Mixture Model (GMM)
• Captures multiple co-existing events and its impact on traffic
– Auto Regressive (AR) Models
• Captures temporal dependencies in traffic dynamics
– Restricted Switching Linear Dynamical System
• Exploits Domain Common Sense for Stationarity
• One LDS model per road link per week hour (24 hr x 7 days / week
=> 168 models)
• Anomaly Model
– Cf. Box and Whisker plots
T. K. Prasad 20
Abstracting Traffic Behavior: Traffic Data Model
22. Histogram of speed values
collected from June 1st 12:00 AM to June 2nd 12:00 AM
Histogram of travel time values
collected from June 1st 12:00 AM to June 2nd 12:00 AM
22
Traffic Data: First Peek
23. Most of the drivers tend to
go 5 km/h over the posted speed limit
There are relatively less drivers who
go more than 10 km/h over the
posted speed limit
There are situations in a day where the
drivers are going (forced) below the
speed limit e.g., rush hour traffic
Do these histograms resemble any probability distribution?
23
Traffic Data: Possible Explanation
24. “many variables such as height, weight, IQ scores, reading ability, job satisfaction,
blood pressure turn out to have distributions that are bell-shaped or normal.”2
Popularized by Gauss in 1809 while he used it for analyzing astronomical data and hence now
popularly known as the Gaussian Distribution.
http://en.wikipedia.org/wiki/Normal_distribution
2http://peoplelearn.homestead.com/Topic3NORMAL1.html
P(x) = G(μ, σ2)
24
Gaussian Distribution
26. Assume Normalcy to be uninterrupted traffic flow
July 2014 has no events so, we
hypothesize higher log-likelihood
score
June 2014 has many events so, we
hypothesize lower log-likelihood
score
-115655.8
-125974.3
26
Golden Gate Fields: Comparing Months with Varying Event Occurrences
28. • Differentiate various traffic dynamics
– Gaussian mixture model is too course grained as it does not discriminate
between increasing traffic over an hour from decreasing traffic over the
same hour.
• Account for unobserved factors
– Autoregressive models cannot capture unobserved factors
• E.g., Traffic volume, which may be unobserved dictates the manifestation of events in link
speed and travel time variations.
– Linear Dynamical System introduces latent state-based model
• E.g., Traffic volume (low vs high), road lane closures, and weather conditions (visibility) can
impact how observations evolve.
• Emission/Transition matrix and Gaussian noise captures stochasticity.
T. K. Prasad 28
Modeling Traffic Dynamics: Statistical Models and Intuitions
29. • Characterize data time series (by learning
distribution of each time point behavior using
mean and variance)
• Pick a realizable mediod time series as prototype
for comparison summarized using LDS parameters
29
Linear Dynamical System Model
32. • Normalcy : Log Likelihood scores of traces from event free data visualized
as box and whiskers plot
– Intertwined with long-term construction event influence
• Anomaly : Log Likelihood score falls beyond whiskers threshold for
eventful data
T. K. Prasad 32
Log-likelihood
score
Tagging Anomalies: Intuitions
33. • How?
– Extract events from textual tweets stream
– Build statistical models of normalcy, and thereby
anomaly, from numerical sensor data streams
– Correlate multimodal streams, using spatio-
temporal information, to annotate “anomalies” in
sensor data time series with textual events
T. K. Prasad 33
Traffic Domain Use Case (open data)
34. • If an anomaly is detected on a link L and during time
period [tst, tet], then the anomaly is explained by an
event if the event occurred in the vicinity within
0.5km radius and during [tst-1, tet+1].
• CAVEAT: An anomaly may not be explained because
of missing data.
T. K. Prasad 34
Spatio-temporal co-occurrence criteria
35. • Data collected from San Francisco Bay Area between May 2014 to May
2015
– 511.org:
• 1,638 traffic incident reports
• 1.4 billion speed and travel time observations
– Twitter Data: 39,208 traffic related incidents extracted from over 20 million
tweets1
• Naïve implementation for learning normalcy models for 2,534 links
resulted in 40 minutes per link (~ 2 months of processing time for our
data)
– 2.66 GHz, Intel Core 2 Duo with 8 GB main memory
• Scalable implementation by exploiting the nature of the problem resulted
in learning normalcy models within 24 hours
– The Apache Spark cluster used in our evaluation has 864 cores and 17TB main
memory.
35
1Anantharam, P. 2014. Extracting city traffic events from so- cial streams. https://osf.io/b4q2t/wiki/home/
Experimental Data Statistics And Infrastructure
Point of this slide: heterogeneity and uncertainty
Improve coverage
Past work for comparison: variation in bike hires based on events in the city (e.g., parades, sports, bad weather)
Annotate sensor data stream with tweets : timelines / Google trends
Annotate sensor data stream with tweets : timelines / Google trends
[Kumaran and Allan 2004] Giridhar Kumaran and James Allan. 2004. Text classification and named entities for new event detection. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 297–304.
[Lampos and Cristianini 2012] Vasileios Lampos and Nello Cristianini. 2012. Nowcasting events from the social web with statistical learn- ing. ACM Transactions on Intelligent Systems and Technology (TIST) 3, 4 (2012), 72.
[Roitman et al. 2012] Haggai Roitman, Jonathan Mamou, Sameep Mehta, Aharon Satt, and LV Subramaniam. 2012. Harnessing the Crowds for smart city sensing. In Proceedings of the 1st international workshop on Multimodal crowd sensing. ACM, 17–18.
[Ritter et al. 2012] Alan Ritter, Oren Etzioni, Sam Clark, and others. 2012. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1104–1112.
[Wang et al. 2012] Xiaofeng Wang, Matthew S Gerber, and Donald E Brown. 2012. Automatic crime prediction using events extracted from twitter posts. In Social Computing, Behavioral-Cultural Modeling and Prediction. Springer, 231–238.
[Becker et al. 2011] Hila Becker, Mor Naaman, and Luis Gravano. 2011. Beyond Trending Topics: Real-World Event Identification on Twitter.. In ICWSM.
Annotate sensor data stream with tweets : timelines / Google trends
Annotate sensor data stream with tweets : timelines / Google trends
We modify SLDS to RSLDS by avoiding Markovian dependence among switches.
In reality, the switches are temporally related but we decouple them for simplicity because we know the time.
1733 – DeMoivre developed the normal curve mathematically as a binomial distribution approx.
1783 – Laplace used normal curves to describe distribution of errors
PDF = 1/σ√(2π) * e^(- (x - μ)^2 / 2*σ^2)
Markovian: : Current observation depends on previous observation.
If we consider different samples as IID and summarize variations using a single Guassian distribution, we may miss time-based behavioral changes.
Output “linearly or as Guassian” follows the state, but state change can be non-linear Or reset periodically.
Should number of additional latent states (beyond the observables) in traffic case be the same as the other influencers?
Low volume : traffic speeds and link transit times can vary immensely …
High volume : traffic speeds and link transit times should saturate …
---------
Latent variables are abduced from sample observations and then used for predictions …
---------
HOV
Markovian: : Current observation depends on previous observation. Do the states remember the previous time’s value?
Why time series trajectory?: If we consider different samples as IID and summarize variations using a single Guassian distribution, we may miss time-based behavioral changes.
E.g., an increasing and a decreasing time series may give the semblance of constancy
Boxplot of log-likelihood scores
accumulated for each hour of day
Annotate sensor data stream with tweets : timelines / Google trends
511.org data
Event data
Apache SPARK Reimplementation