Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications

Semantics-empowered Big Data Processing for PCS Applications
Krishnaprasad Thirunarayan (T. K. Prasad) and Amit Sheth
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, OH-45435

Outline
• 5 V’s of Big Data Research
• Semantic Perception for Scalability
• Lightweight semantics to manage heterogeneity
– Cost-benefit trade-off and continuum
• Hybrid Knowledge Representation and Reasoning
– Anomaly, Correlation, Causation
211/15/2013 Prasad

5V’s of Big Data Research
Volume
Velocity
Variety
Veracity
Value
11/15/2013 Prasad 3
Big Data => Smart Data

Volume : Assorted Examples
Check engine light analogy
11/15/2013 Prasad 4

Volume : Semantic Perception
11/15/2013 Prasad 5

Weather Use Case
11/15/2013 Prasad 6

Parkinson’s Disease Use Case
11/15/2013 Prasad 7

Heart Failure Use Case
11/15/2013 Prasad 8

Asthma Use Case
11/15/2013 Prasad 9

Traffic Use Case
11/15/2013 Prasad 10

Slow moving
traffic
Link
Description
Scheduled
Event
Scheduled
Event
511.org
511.org
Schedule Information
511.org
Traffic Monitoring
11
Heterogeneity in a Physical-Cyber-Social System

Volume with a Twist
Resource-constrained reasoning on mobile-
devices
11/15/2013 Prasad 12

* based on Neisser’s cognitive model of perception
Observe
Property
Perceive
Feature
Explanation
Discrimination
1
2
Perception Cycle* that exploits background knowledge / domain models
Abstracting raw data
for human
comprehension
Focus generation for
disambiguation and action
(incl. human in the loop)
Prior Knowledge
13

Virtues of Our Approach to Semantic Perception
Blends simplicity, effectiveness, and scalability.
• Declarative specification of explanation and discrimination;
• With applications (e.g., to healthcare) that are of
contemporary relevance and interdisciplinary;
• Using encodings/algorithms that are significant (asymptotic
order of magnitude gain) and necessary (“tractable” due to
time/memory reduction for typical problem sizes); and
• Prototyped using extant PCs and mobile devices.

O(n3) < x < O(n4) O(n)
Efficiency Improvement
• Problem size increased from 10’s to 1000’s of nodes
• Time reduced from minutes to milliseconds
• Complexity growth reduced from polynomial to linear
Evaluation on a mobile device
15

Volume and Velocity
• Lightweight semantics-based Adaptive/Continuous
Filtering
Disaster response use-case
• Building domain models dynamically
11/15/2013 Prasad 16

Dynamic Model Creation
Continuous Semantics 17

Variety
Syntactic and semantic heterogeneity
• in textual and sensor data,
• in (legacy) materials data
• in (long tail) geosciences data
11/15/2013 Prasad 18

Variety (What?): Materials/Geosciences Use Case
• Structured Data (e.g., relational)
• Semi-structured, Heterogeneous Documents
(e.g., Publications and technical specs, which
usually include text, numerics, maps and images)
• Tabular data (e.g., ad hoc spreadsheets and
complex tables incorporating “irregular” entries)
1911/15/2013 Prasad

Variety (How?/Why?): Granularity of Semantics & Applications
• Lightweight semantics: File and document-level
annotation to enable discovery and sharing
• Richer semantics: Data-level annotation and
extraction for semantic search and summarization
• Fine-grained semantics: Data
integration, interoperability and reasoning in
Linked Open Data
Cost-benefit trade-off and continuum
20

Challenges Associated with Typical Spreadsheet/Table
• Meant for human consumption
• Irregular :
– Not simple rectangular grid
• Heterogeneous
– All rows not interpreted similarly
• Complex
– Meaning of each row and each column context
dependent
• Footnotes modify meaning of entries (esp. in materials
and process specifications)
2111/15/2013 Prasad

Practical Semi-Automatic Content Extraction
• DESIGN: Develop regular data structures that
can be used to formalize tabular information.
– Provide a natural expression of data
– Provide semantics to data, thereby removing potential
ambiguities
– Enable automatic translation
• USE: Manual population of regular tables and
automatic translation into LOD
2311/15/2013 Prasad

Variety (What?) : Sensor Data Use Case
Develop/learn domain models to exploit
complementary and corroborative
information
• To relate patterns in multimodal data to
“situation”
• To integrate machine sensed and human
sensed data
11/15/2013 Prasad 24

Variety: Hybrid KRR
Blending data-driven models with declarative
knowledge
– Data-driven: Bottom-up, correlation-
based, statistical
– Declarative: Top-
down, causal/taxonomical, logical
– Refine structure to better estimate parameters
E.g., Traffic Analytics using PGMs + KBs
11/15/2013 Prasad 25

Variety (Why?): Hybrid KRR
Data can help compensate for our overconfidence
in our own intuitions and reduce the extent to
which our desires distort our perceptions.
-- David Brooks of New York Times
However, inferred correlations require clear
justification that they are not coincidental, to
inspire confidence.
11/15/2013 Prasad 26

• Correlations due to common cause or origin
• Coincidental due to data skew or misrepresentation
• Coincidental new discovery
• Strong correlation vs causation
• Anomalous and accidental
• Correlation turning into causations
Correlations vs Causation vs Anomalies
11/15/2013 Prasad 27

• Correlations Due to common cause or origin
– E.g., Planets: Copernicus > Kepler > Newton > Einstein
– E.g., Tall policy claims made by politicians!
– E.g., Hurricanes and Strawberry Pop-Tarts Sales
– E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers
– E.g., CO2 levels and Obesity
– E.g., Pavlovian learning: conditional reflex
11/15/2013 Prasad 28

• Correlations Due to common cause or origin
– E.g., Planets: Copernicus > Kepler > Newton > Einstein
– E.g., Tall policy claims made by politicians!
– E.g., Hurricanes and Strawberry Pop-Tarts Sales
– E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers
– E.g., CO2 levels and Obesity
– E.g., Pavlovian learning: conditional reflex
11/15/2013 Prasad 29

Veracity
Lot of existing work on Trust ontologies, metrics and
models, and on Provenance tracking
• Homogeneous data: Statistical techniques
• Heterogeneous data: Semantic models
11/15/2013 Prasad 30

Veracity
Machine sensing: objective, quantitative,
but prone to environmental effects, battery life, …
Human sensing: subjective, qualitative,
but prone to bias, perceptual errors, rumors, …
Open problem: Improving trustworthiness by
combining machine sensing and human sensing
– E.g., 2002 Überlingen mid-air collision :Pilot incorrectly
using Traffic controller advice over electronic TCAS
system recommendation
11/15/2013 Prasad 31

(More on) Value
Learning domain models from “big data” for
prediction
E.g., Harnessing Twitter "Big Data" for Automatic
Emotion Identification
11/15/2013 Prasad 32

(More on) Value
Discovering gaps and enriching domain models
using data
E.g., Data driven knowledge acquisition method for
domain knowledge enrichment in the healthcare
11/15/2013 Prasad 33

Conclusions
• Glimpse of our research organized around
the 5 V’s of Big Data
• Discussed role in harnessing Value
– Semantic Perception (Volume)
– Continuum of Semantic models to manage
Heterogeneity (Variety)
– Hybrid KRR: Probabilistic + Logical (Variety)
– Continuous Semantics (Velocity)
– Trust Models (Veracity)
3411/15/2013 Prasad

35
thank you, and please visit us at
http://knoesis.org/
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Kno.e.sis
11/15/2013 Prasad
Special Thanks to: Pramod Anantharam and Cory Henson

Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications

Ähnlich wie Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications

Hinweis der Redaktion