Presentation at the AAAI 2013 Fall Symposium on Semantics for Big Data, Arlington, Virginia, November 15-17, 2013
Additional related material at: http://wiki.knoesis.org/index.php/Smart_Data
Related paper at: http://www.knoesis.org/library/resource.php?id=1903
Abstract: We discuss the nature of Big Data and address the role of semantics in analyzing and processing Big Data that arises in the context of Physical-Cyber-Social Systems. We organize our research around the five V's of Big Data, where four of the Vs are harnessed to produce the fifth V - value. To handle the challenge of Volume, we advocate semantic perception that can convert low-level observational data to higher-level abstractions more suitable for decision-making. To handle the challenge of Variety, we resort to the use of semantic models and annotations of data so that much of the intelligent processing can be done at a level independent of heterogeneity of data formats and media. To handle the challenge of Velocity, we seek to use continuous semantics capability to dynamically create event or situation specific models and recognize new concepts, entities and facts. To handle Veracity, we explore the formalization of trust models and approaches to glean trustworthiness. The above four Vs of Big Data are harnessed by the semantics-empowered analytics to derive Value for supporting practical applications transcending physical-cyber-social continuum.
Take control of your SAP testing with UiPath Test Suite
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications
1. Semantics-empowered Big Data Processing for PCS Applications
Krishnaprasad Thirunarayan (T. K. Prasad) and Amit Sheth
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, OH-45435
2. Outline
• 5 V’s of Big Data Research
• Semantic Perception for Scalability
• Lightweight semantics to manage heterogeneity
– Cost-benefit trade-off and continuum
• Hybrid Knowledge Representation and Reasoning
– Anomaly, Correlation, Causation
211/15/2013 Prasad
3. 5V’s of Big Data Research
Volume
Velocity
Variety
Veracity
Value
11/15/2013 Prasad 3
Big Data => Smart Data
12. Volume with a Twist
Resource-constrained reasoning on mobile-
devices
11/15/2013 Prasad 12
13. * based on Neisser’s cognitive model of perception
Observe
Property
Perceive
Feature
Explanation
Discrimination
1
2
Perception Cycle* that exploits background knowledge / domain models
Abstracting raw data
for human
comprehension
Focus generation for
disambiguation and action
(incl. human in the loop)
Prior Knowledge
13
14. Virtues of Our Approach to Semantic Perception
Blends simplicity, effectiveness, and scalability.
• Declarative specification of explanation and discrimination;
• With applications (e.g., to healthcare) that are of
contemporary relevance and interdisciplinary;
• Using encodings/algorithms that are significant (asymptotic
order of magnitude gain) and necessary (“tractable” due to
time/memory reduction for typical problem sizes); and
• Prototyped using extant PCs and mobile devices.
15. O(n3) < x < O(n4) O(n)
Efficiency Improvement
• Problem size increased from 10’s to 1000’s of nodes
• Time reduced from minutes to milliseconds
• Complexity growth reduced from polynomial to linear
Evaluation on a mobile device
15
16. Volume and Velocity
• Lightweight semantics-based Adaptive/Continuous
Filtering
Disaster response use-case
• Building domain models dynamically
11/15/2013 Prasad 16
18. Variety
Syntactic and semantic heterogeneity
• in textual and sensor data,
• in (legacy) materials data
• in (long tail) geosciences data
11/15/2013 Prasad 18
19. Variety (What?): Materials/Geosciences Use Case
• Structured Data (e.g., relational)
• Semi-structured, Heterogeneous Documents
(e.g., Publications and technical specs, which
usually include text, numerics, maps and images)
• Tabular data (e.g., ad hoc spreadsheets and
complex tables incorporating “irregular” entries)
1911/15/2013 Prasad
20. Variety (How?/Why?): Granularity of Semantics & Applications
• Lightweight semantics: File and document-level
annotation to enable discovery and sharing
• Richer semantics: Data-level annotation and
extraction for semantic search and summarization
• Fine-grained semantics: Data
integration, interoperability and reasoning in
Linked Open Data
Cost-benefit trade-off and continuum
20
21. Challenges Associated with Typical Spreadsheet/Table
• Meant for human consumption
• Irregular :
– Not simple rectangular grid
• Heterogeneous
– All rows not interpreted similarly
• Complex
– Meaning of each row and each column context
dependent
• Footnotes modify meaning of entries (esp. in materials
and process specifications)
2111/15/2013 Prasad
23. Practical Semi-Automatic Content Extraction
• DESIGN: Develop regular data structures that
can be used to formalize tabular information.
– Provide a natural expression of data
– Provide semantics to data, thereby removing potential
ambiguities
– Enable automatic translation
• USE: Manual population of regular tables and
automatic translation into LOD
2311/15/2013 Prasad
24. Variety (What?) : Sensor Data Use Case
Develop/learn domain models to exploit
complementary and corroborative
information
• To relate patterns in multimodal data to
“situation”
• To integrate machine sensed and human
sensed data
11/15/2013 Prasad 24
25. Variety: Hybrid KRR
Blending data-driven models with declarative
knowledge
– Data-driven: Bottom-up, correlation-
based, statistical
– Declarative: Top-
down, causal/taxonomical, logical
– Refine structure to better estimate parameters
E.g., Traffic Analytics using PGMs + KBs
11/15/2013 Prasad 25
26. Variety (Why?): Hybrid KRR
Data can help compensate for our overconfidence
in our own intuitions and reduce the extent to
which our desires distort our perceptions.
-- David Brooks of New York Times
However, inferred correlations require clear
justification that they are not coincidental, to
inspire confidence.
11/15/2013 Prasad 26
27. • Correlations due to common cause or origin
• Coincidental due to data skew or misrepresentation
• Coincidental new discovery
• Strong correlation vs causation
• Anomalous and accidental
• Correlation turning into causations
Correlations vs Causation vs Anomalies
11/15/2013 Prasad 27
28. • Correlations Due to common cause or origin
– E.g., Planets: Copernicus > Kepler > Newton > Einstein
• Coincidental due to data skew or misrepresentation
– E.g., Tall policy claims made by politicians!
• Coincidental new discovery
– E.g., Hurricanes and Strawberry Pop-Tarts Sales
• Strong correlation vs causation
– E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers
• Anomalous and accidental
– E.g., CO2 levels and Obesity
• Correlation turning into causations
– E.g., Pavlovian learning: conditional reflex
Correlations vs Causation vs Anomalies
11/15/2013 Prasad 28
29. • Correlations Due to common cause or origin
– E.g., Planets: Copernicus > Kepler > Newton > Einstein
• Coincidental due to data skew or misrepresentation
– E.g., Tall policy claims made by politicians!
• Coincidental new discovery
– E.g., Hurricanes and Strawberry Pop-Tarts Sales
• Strong correlation vs causation
– E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers
• Anomalous and accidental
– E.g., CO2 levels and Obesity
• Correlation turning into causations
– E.g., Pavlovian learning: conditional reflex
Correlations vs Causation vs Anomalies
11/15/2013 Prasad 29
30. Veracity
Lot of existing work on Trust ontologies, metrics and
models, and on Provenance tracking
• Homogeneous data: Statistical techniques
• Heterogeneous data: Semantic models
11/15/2013 Prasad 30
31. Veracity
Machine sensing: objective, quantitative,
but prone to environmental effects, battery life, …
Human sensing: subjective, qualitative,
but prone to bias, perceptual errors, rumors, …
Open problem: Improving trustworthiness by
combining machine sensing and human sensing
– E.g., 2002 Überlingen mid-air collision :Pilot incorrectly
using Traffic controller advice over electronic TCAS
system recommendation
11/15/2013 Prasad 31
32. (More on) Value
Learning domain models from “big data” for
prediction
E.g., Harnessing Twitter "Big Data" for Automatic
Emotion Identification
11/15/2013 Prasad 32
33. (More on) Value
Discovering gaps and enriching domain models
using data
E.g., Data driven knowledge acquisition method for
domain knowledge enrichment in the healthcare
11/15/2013 Prasad 33
34. Conclusions
• Glimpse of our research organized around
the 5 V’s of Big Data
• Discussed role in harnessing Value
– Semantic Perception (Volume)
– Continuum of Semantic models to manage
Heterogeneity (Variety)
– Hybrid KRR: Probabilistic + Logical (Variety)
– Continuous Semantics (Velocity)
– Trust Models (Veracity)
3411/15/2013 Prasad
35. 35
thank you, and please visit us at
http://knoesis.org/
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Kno.e.sis
11/15/2013 Prasad
Special Thanks to: Pramod Anantharam and Cory Henson
Hinweis der Redaktion
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications Big Data Research: Sensor, Social, and Cyber-Physical Systems
Relevance of our research to Big Data and PCS applications by organizing it around the 5 V’sRole in overcoming … challenge Volume Variety Both by combining probabilistic and logical knowledge
Size, rate of flow/accumulation and change, (syntactic and semantic) heterogeneity, trustworthiness, end-use(develop techniques to harness data to derive value in the presence of these challenges)
Huge amount of raw data generated by continuous monitoring => actionable nuggets for decision makingMJFoxFoundation Parkinson disease challenge : Diagnosis and progression-------Embarrassingly parallel computations (Map-Reduce programming model) can be implemented on distributed fault-tolerant architectures/systems (HDFS + Hadoop) Using redundant storage and computations answer for homogeneous data------Semantics-based approaches needed to deal with variety or to transcend abstraction levels--------Check engine light signals/alerts : on detecting -> anomaly / problem => for further analysis / action--------
What does semantic perception entail?Making sense of large amounts of low level data and communicating it in a meaningful waye.g. Ranges, aggregate/statistical measures GOAL: “Buzz it up!”---------------------Semantic Perception: Converting Sensory Observations to Abstractions Using perception cycle and domain models: derive explanation, determine focus to disambiguate and discriminate for taking actionsHybrid reasoning: interleaved abductive and deductive components[**complex domain models reflecting comorbidities : high-fidelity models**] [**Gleaning Patterns from data**] [**Personalization**]
---------------------------ParkinsonMild(person) = Tremor(person) ∧ PoorBalance(person)ParkinsonModerate(person) = MoveSlow(person) ∧ PoorSleep(person) ∧ MonotoneSpeech(person)ParkinsonAdvanced(person) = Fall(person)----------------------------Loss of speech / food intake impossible / lack of balance => is there value in continuous monitoring? => Signatures for proactive control?----------------------------Dataset Characteristics: 8 weeks of data from 5 sensors on a smart phone, collected for 16 patients resulting in ~12 GB (with lot of missing data).
cardiologist evaluates the risk based on periodic monitoring data (+ human sensed health info inputs)--------------------------------------------Reduce preventable readmissions: 25% patients readmitted 30 day after discharge 50% patients readmitted after 60mo
EVIDENCE-BASED Approach to diagnosis, treatment and controlEnvironmental: CO, CO2, NO, pollen counts, mold, dust, smoke, etc.Physiological: Wheezometer (breathing), heart rate, etc25 million people in the U.S. are diagnosed with asthma (7 million are children)1.300 million people suffering from asthma worldwide2.Asthma related healthcare costs alone are around $50 billion a year2.155,000 hospital admissions and 593,000 emergency department visits in 20063.
Current predictions and long-term planning
Point of this slide: correlations
An Efficient Bit Vector Approach to Semantics-Based Machine Perception in Resource-Constrained Devices.Resources: memory, cpu, power, …Healthcare use-case – privacy, mobility, cheap onboard sensors, personalization, power, convenience-considerations dominateAbstracting and summarizing multimodal machine sensed observations + human observations for actionable and human accessible situational awareness and decision making---------Characteristics of a big data problem
perception cycle contains interleaved iterative execution of two primary phasesExplanation (abductive)translating low-level signals into high-level abstractions inference to the best explanationDiscrimination (declarative)focusing attention on those properties that will help distinguish between multiple possible explanationsused to intelligently task sensors and collect additional observations (rather than brute force approach of blindly collecting all observations)-----------------------Ask human relevant questions
Solving information overload problem – improving relevance (both recall and precision) E.g., in the context of important/unfolding events, disaster scenarios, … learn to rank and select relevant hashtags for improved crawling and filtering-------------Use keywords to carve out a relevant model of the domain for scalable and more focused information crawling, disambiguation and extraction in the face of rapidly unfolding event------------Leveraging Semantics for Detection of Event-Descriptors on Twitter
Use seed keywords and tweets to carve out a relevant model from Wikipedia pages : DoozerTrack dynamically unfolding events
Syntactic : different data formatsSemantic :Conceptual modelsSemantic : multimodal sensing + different conceptual models--------------Complementary and corroborative information => complete and reliable/robust;---------------------------“Semantics Empowered Web 3.0” book
Variery challenge: Sources of heterogeneity (Addl:UOM, table captions)Use text-basedmetadata to help mediate
Semantics at different levels of detail and developed in stages : ---------------------Ease of use by domain expertsFaster and wider adoption, promoting evolutionLow upfront cost to supportShallow semantics has wider applicability to a range of documents/data and appeal to a broader communityBottom-line: “Learn to Walk before we Run”------------------------------------------------------Controlled vocabularies <= Lightweight ontologies [ legacy vocab + community agreed semantic relationships] <= Formal ontologiesOriginal document vs its translation => traceability (provenance)---------Past Research: We have dealt with top-down UMLS ontology vs bottom-up facts from Pubmed in HPCO (Literature-based discovery -> LBD)-----------------------------RECALL: materials and process specs typically describe: composition, processing, testing, and packaging of materialFormalizing a procedure (a process or a test) as an aggregation of characteristic/parameter-value pairs = LOD Eventually allows combining and comparing specs==============================Biomaterials use case: Gold surface affinity of peptide sequence
Use case: Materials and Process specsCompact structures for sharing information : Minimize duplication
AMS 4928Nhttp://www.youtube.com/watch?v=D8U4G5kcpcMhttp://www.ndt-ed.org/EducationResources/CommunityCollege/Materials/Mechanical/Mechanical.htmMost structural materials are anisotropic, which means that their material properties vary with orientation.In products such as sheet and plate, the rolling direction is called the longitudinal direction, the width of the product is called the (long) transverse direction, and the thickness is called the short transverse direction.
In content extraction from tables, a human extractor formalizes the data using “predefined” tables, and a wizard then generates LOD from it.Human Extractor is responsible for gleaning the semantics (manual part)Wizard responsible for the mechanical translation (automatic part)==================The yardstick of success is the extent to which regular parts of the table can be automatically assimilated and translated, while leaving more complex parts for manual guidance.
Event, disease, human comprehensible features …--------------Slow traffic vs reason for it (accident vs tree fall): semantics to data : sensors monitoring traffic space-----------Cardiology use case – how a patient is feeling – giddy, depressed, etc.
Idea : Glean statistical correlations from data (PGM) and enrich/validate it using symbolic knowledge (manually curated) orient undirected links, delete conflicting links, + complement nodes and links Explicit declarative knowledge obviates the need to generate it, especially in the context of sparse/skewed data PLUS it will be relaible------------Structure learning uncovers qualitative conditional dependencies integrate with declarative information using progressively expressive graphical models : same abstraction levelParameter learning using refined structure to estimate better fitting model
---------------------discovering “unexpected” correlations, and then seeking a transparent basis for them, seems worthy of pursuit. For instance, consider the controversies surrounding assertions such as ‘smoking causes cancer’, ‘high debt causes low growth’, ‘low growth causes high debt’, and ‘religious fanaticism breeds terrorists’.
e.g., tides and ebbs caused by the alignment of earth, sun and moon, around full moon and new moon; “anomalous” orbits of Solar system planets w.r.t. the “circular” motion of stars in geocentric theory (‘planet’ is ‘wanderer’ in Greek) explained by heliocentrism and theory of gravitation, (Copernicus) correlation of time period and distance of planets (Kepler)and the “anomalous” precision of Mercury’s orbit clarified by General Theory of Relativity; (Einstein) C-peptide protein can be used to estimate insulin produced by a patient’s pancreas => ANOMALY (Copernicus) and REGULARITY (Kepler) => CAUSE (Newton)=> (Newtonian Mechanics) => (General Theory of Relativity)Bold claims all the time in politicsBeer vs diaper; Walmart’s hurricanes vspoptarts ---------------------(4) Stress/spicy foods are correlated with peptic ulcers, but the latter are caused by Helicobacter Pyrolias demonstrated by Nobel Prize winning works of Marshall and Warren.ORIENTATION UNCLEAR: ‘high debt causes low growth’, ‘low growth causes high debt’, ------------------(5) Since the 1950s, both the atmospheric Carbon Dioxide level and obesity levels have increased sharply. (6) Pavlovian learning induced conditional reflex, and some of the financial market moves, seem to be classic cases of correlation turning into causation! ---------PARADOXES : THE SEEDS OF PROGRESSZeno’s paradox, Hydrostatic paradox, light speed constant in all reference frames, CBR, Expanding universe, …
e.g., tides and ebbs caused by the alignment of earth, sun and moon, around full moon and new moon; “anomalous” orbits of Solar system planets w.r.t. the “circular” motion of stars in geocentric theory (‘planet’ is ‘wanderer’ in Greek) explained by heliocentrism and theory of gravitation, (Copernicus) correlation of time period and distance of planets (Kepler)and the “anomalous” precision of Mercury’s orbit clarified by General Theory of Relativity; (Einstein) C-peptide protein can be used to estimate insulin produced by a patient’s pancreas => ANOMALY (Copernicus) and REGULARITY (Kepler) => CAUSE (Newton)=> (Newtonian Mechanics) => (General Theory of Relativity)Bold claims all the time in politicsBeer vs diaper; Walmart’s hurricanes vspoptarts ---------------------(4) Stress/spicy foods are correlated with peptic ulcers, but the latter are caused by Helicobacter Pyrolias demonstrated by Nobel Prize winning works of Marshall and Warren.ORIENTATION UNCLEAR: ‘high debt causes low growth’, ‘low growth causes high debt’, ------------------(5) Since the 1950s, both the atmospheric Carbon Dioxide level and obesity levels have increased sharply. (6) Pavlovian learning induced conditional reflex, and some of the financial market moves, seem to be classic cases of correlation turning into causation! ---------PARADOXES : THE SEEDS OF PROGRESSZeno’s paradox, Hydrostatic paradox, light speed constant in all reference frames, CBR, Expanding universe, …
e.g., tides and ebbs caused by the alignment of earth, sun and moon, around full moon and new moon; “anomalous” orbits of Solar system planets w.r.t. the “circular” motion of stars in geocentric theory (‘planet’ is ‘wanderer’ in Greek) explained by heliocentrism and theory of gravitation, (Copernicus) correlation of time period and distance of planets (Kepler)and the “anomalous” precision of Mercury’s orbit clarified by General Theory of Relativity; (Einstein) C-peptide protein can be used to estimate insulin produced by a patient’s pancreas => ANOMALY (Copernicus) and REGULARITY (Kepler) => CAUSE (Newton)=> (Newtonian Mechanics) => (General Theory of Relativity)Bold claims all the time in politicsBeer vs diaper; Walmart’s hurricanes vspoptarts ---------------------(4) Stress/spicy foods are correlated with peptic ulcers, but the latter are caused by Helicobacter Pyrolias demonstrated by Nobel Prize winning works of Marshall and Warren.ORIENTATION UNCLEAR: ‘high debt causes low growth’, ‘low growth causes high debt’, ------------------(5) Since the 1950s, both the atmospheric Carbon Dioxide level and obesity levels have increased sharply. (6) Pavlovian learning induced conditional reflex, and some of the financial market moves, seem to be classic cases of correlation turning into causation! ---------PARADOXES : THE SEEDS OF PROGRESSZeno’s paradox, Hydrostatic paradox, light speed constant in all reference frames, CBR, Expanding universe, …
Different forms of trust; What features contribute to trust; how do we combine trust; Trust propagation: aggregation and chaining;Application-specific basis / AxiomaticEcommerce examples : risk tolerance (propensity to trust) + trustworthiness = trust
complementary and corroboratory
Biggest hurdle in ML : Significant training datasetTraining bigdata: tweets with emotion hashtags (provided by the tweet creator)Learn domain model to associate emotion hashtags with tweet content Glean/predict emotions from “untagged” tweets using this model
EMR
Semantic Perception : Hybrid Abductive/Deductive Reasoning (Volume)Cost-benefit trade-off and Continuum of Semantic models to manage Heterogeneity (Variety)Hybrid Knowledge Representation and Reasoning : Probabilisitc + Logical : structure + parameter estimation (Variety)