1. http://cascaderesearch.org
DATA-DRIVEN SENSEMAKING INAN EVOLVING, NOISY
WORLD
K. Selçuk Candan
Professor of Computer Science and Engineering
Director, Center for Assured and Scalable Data Engineering (CASCADE)
Supported by
• NSF; “Data Management for Real-Time Data Driven Epidemic Spread Simulations”
• NSF; “RAPID - Understanding the Evolution Patterns of the Ebola Outbreak in West-Africa and Supporting Real-Time Decision
Making and Hypothesis Testing through Large Scale Simulations”
• NSF; “E-SDMS: Energy Simulation Data Management System Software”
• JCI; “I2AV: Integrate, Index, Analyze, and Visualize Energy Data for Data-driven Simulations and Optimizations”
• NSF; “An Infrastructure to Support Complex Financial Patterns (CFP) based Real-Time Services Delivery and Visual Analytics”
• NSF I/UCRC planning grant (NSF-IIP1464579) for “Center for Assured and Scalable Data Engineering”
2. http://cascaderesearch.org
“Sense”making…what does it mean?
• Etymology:
• 1st sense: from latin “sentire” or “to perceive”
• any of the faculties, as sight, hearing, smell, taste, or touch, by which
humans and animals perceive stimuli originating from outside or inside the
body
• 2nd sense: “to attain awareness or understanding of…”
• “awareness” implies vigilance in observing or alertness in drawing
inferences from what one experiences
• “understanding” is the power to make experience intelligible by applying
concepts and categories
3. http://cascaderesearch.org
..did you notice something?
• …there is a gap between the first meaning (feel, measurement) and
the second (awareness, understanding)
• ..and that gap (or the data infrastructure needed to bridge that gap)
is what my research is about
knowledgebasessensors
awareness, understanding, control
sensing
applicat
ion
sensemaking
4. http://cascaderesearch.org
energy
business/enterprise
We are living in a dynamic world…
health-care
entertainment
education
rehabilitation
elderly-care
production life-sciences
sports
security
defense
transportation
supply-chain
retail
arts
advertisement
child-care
pet-care
personal-data management
robotics
smart-rooms
smart-offices
training
space exploration
sciences
5. http://cascaderesearch.org
Epidemics….
• SARS (Severe Acute Respiratory Syndrome) epidemic is estimated to have started in
China in November 2002, had spread to 29 countries by August 2003
• A pandemic similar to the swine flu in 2009 is estimated to cost $360 billion in a mild
scenario to the global economy and up to $4 trillion in an ultra scenario, within the first
year of the outbreak
• The World Health Organization declared the Ebola epidemic in West Africa a Public
Health Emergency of International Concern on August 8th, 2014, with exponential
dynamics characterizing the initial growth in numbers of new cases in some areas
K. Selcuk Candan @ ASU
• NSF III#1318788 “Data Management for Real-Time Data Driven Epidemic Spread Simulations”
6. http://cascaderesearch.org
Epidemics….
K. Selcuk Candan @ ASU
• NSF III#1318788 “Data Management for Real-Time Data Driven Epidemic Spread Simulations”
Not much room for error
Both action and inaction can have high costs in terms
of their economic impacts and human lives affected
7. http://cascaderesearch.org
Bad news…
• Challenge #1: Epidemic data involves
• 100s of inter-dependent parameters,
• spanning multiple layers and geo-spatial frames,
• affected by complex dynamic processes operating at different resolutions.
• Challenge #2: Given the
• unpredictability of an epidemic and
• unpredictability of the actions of various independent agencies,
decision makers need to generate many thousands of simulations,
each with different parameters corresponding to plausible scenarios.
• Challenge #3: Models and simulations need to be continuously revised
based on real-world data as the epidemic and intervention mechanisms
evolve.
K. Selcuk Candan @ ASU
8. http://cascaderesearch.org
Building energy sector…
• Building sector was responsible for nearly half of CO2 emissions in US in 2009.
• According to the US Energy Information Administration, buildings consume more
energy than any other sector, with 48.7% of the overall energy consumption and
building energy consumption is projected to grow faster than the industry and
transportation sectors.
K. Selcuk Candan @ ASU
• NSF SI^2#1339835 “E-SDMS: Energy Simulation Data Management System Software”
• JCI Grant “I2AV: Integrate, Index, Analyze, and Visualize Energy Data for Data-driven Simulations and Optimizations ”
U.S. Energy Information Administration. 2008.
International Energy Statistics
9. http://cascaderesearch.org
Good news….
• By 2030, 82% of the US building stock is expected to be relying on smart and cleaner
energy technologies
• Building energy management systems (BEMSs) process large volumes of data, including
• continuously collected heating, ventilation, and air conditioning (HVAC) sensor and actuation data of
residential and commercial buildings of all types and sizes
• other sensory data, such as occupancy, humidity, lighting levels, air speed and quality,
• architectural, mechanical, and building automation system configuration data,
• local whether and GIS data that provide contextual information, as well as
• price, consumption, and cost data from electricity (such as smart grid) and gas utilities
K. Selcuk Candan @ ASU
• NSF SI^2#1339835 “E-SDMS: Energy Simulation Data Management System Software”
• JCI Grant “I2AV: Integrate, Index, Analyze, and Visualize Energy Data for Data-driven Simulations and Optimizations ”
http://econtrol.me/Smart%20Building.html
http://customloungeuk.com
Because of the
• size and complexity of the data and
• the varying spatial and temporal scales at which the key
processes operate;
experts lack the means to understand and predict relevant
processes.
10. http://cascaderesearch.org
energy
business/enterprise
Sensemaking in a dynamic world…
health-care
entertainment
education
rehabilitation
elderly-care
production life-sciences
sports
security
defense
transportation
supply-chain
retail
arts
advertisement
child-care
pet-care
personal-data management
robotics
smart-rooms
smart-offices
training
space exploration
sciences
Sense &
Integrate
Simulate
& Predict
Validate
&
Interpret
Act &
Adapt
(a) Sense & Integrate:
take as inputs, and integrate, data, and models of
the application space and continuously sensed real-
time observational data,
(b) Simulate & Predict:
support data-driven simulation and predictive
analysis over integrated data sets and models,
(c) Validate & Interpret:
enable validation of observations, models, and
simulation/prediction results and intuitive data and
result representation to provide trustworthy and
accurate decision making, and
(d) Act & Adapt:
provide continuous adaptation of models and
predictions based on the validated predictions and
observations.
11. http://cascaderesearch.org
energy
business/enterprise
Data challenges in a dynamic world
health-care
entertainment
education
rehabilitation
elderly-care
production life-sciences
sports
security
defense
transportation
supply-chain
retail
arts
advertisement
child-care
pet-care
personal-data management
robotics
smart-rooms
smart-offices
training
space exploration
sciences
(I)mprecision
(S)parsity
(Q)uality/Noise
ISQ
(H)igh-dimensional
(M)ulti-modal
Inter-(L)inked
(E)volving
HMLE
(V)olume
(V)elocity
(V)ariety
3Vs
12. http://cascaderesearch.org
energy
business/enterprise
Data challenges in a dynamic world
health-care
entertainment
education
rehabilitation
elderly-care
production life-sciences
sports
security
defense
transportation
supply-chain
retail
arts
advertisement
child-care
pet-care
personal-data management
robotics
smart-rooms
smart-offices
training
space exploration
sciences
(I)mprecision
(S)parsity
(Q)uality/Noise
ISQ
(H)igh-dimensional
(M)ulti-modal
Inter-(L)inked
(E)volving
HMLE
(V)olume
(V)elocity
(V)ariety
3Vs
13. http://cascaderesearch.org
ASU Center forAssured and Scalable Data Engineering
CASCADE-IUCRC
Industry/University Collaborative Research
Center (I/UCRC) * NSF I/UCRC planning grant (NSF-IIP1464579)
14. http://cascaderesearch.org
“Big Data” Industry Roundtable at ASU
• Co-organized with IBM
• On-site or off-site participation
• Aerojet,
• Avnet,
• Boeing,
• Facebook
• Google
• IBM TJ Watson (Exascale System Software),
• IBM Smart Analytics
• IO Data Centers,
• Johnson Controls,
• LinkedIn,
• Lockhed Martin,
• Mayo Clinic,
• NEC Labs,
• Oracle,
• Salt River Project,
• SAP
16. http://cascaderesearch.org
Key knowledge gaps..
• Six most critical knowledge competency groups (in terms of the value
gap – i.e., the difference between current and desired states of the
knowledge area)
1. temporal and spatial analyses,
2. summarization, cleaning, visualization, anomaly detection,
3. real-time processing for streaming data,
• media analytics
4. representations and fusion for unstructured/structured data, semantic Web,
• make unstructured data queriable, prioritize and rank data, correlate and identify the
gaps in the data
5. graph-based models, social networks,
• entity analytics, (social and other) network analytics
6. performance and scalability, distributed architectures.
"Hunting for the Value Gaps in Data Management, Services, and Analytics”
ACM SIGMOD blog; http://wp.sigmod.org/
17. http://cascaderesearch.org
CASCADE Mission
• Mission: to support the innovation of data architectures
and tools that can match the scale of the data and support
timely and assured decision making to generate value.
Validate &
Interpret
Act &
Adapt
Sense &
Integrate
Simulate
& Predict
Data
Management
Data Analysis
Data
Assurance
18. http://cascaderesearch.org
modeling
organization
storage/indexing
replication
fusion/integration
ingest compression visualization
partitioning
hiding
security encryption
repudiation provenance
authentication
trust models
access control
finger printing
tamper detectionsummarization/aggregation
sampling
cleaning
normalization
annotation
dimensionality reduction
media analysis
machine learning
FUNDAMENTAL
KNOWLEDGE
ENABLING
TECHNOLOGIES
SYSTEMS
Technology Element:
Real-time
Data Processing
and Analysis
Technology Element:
Parallel and Distributed
Data Processing
and Analysis
Technology Element:
High-dimensional
and
Multi-modal
Data Processing
and Analysis
Technology Element:
Trusted and
Privacy-preserving
Data Processing
and Analysis
Fundamental Insights
Partners &
Stakeholders
SystemRequirements
TECHNOLOGY
BARRIERS:
• availability,
• timeliness,
• cost,
• consistency,
• trust,
• privacy,
• security,
• compliance, and
• accessibility
FUNDAMENTAL BARRIERS:
• heterogeneous data and models,
• transient, mobile, and distributed data,
• multi-scale, multi-resolution data,
• data with different quality, precision,
privacy, security, and trust levels, and
• varying data volume and characteristics
• high dimensional, complex data
Requirements
Product and
Outcomes
19. http://cascaderesearch.orgmodeling
hiding
security encryption
repudiation provenance
authentication
trust models
access control
finger printing
tamper detectionsummarization/aggregation
sampling
cleaning
normalization
annotation
dimensionality reduction
media analysis
machine learning
ENABLING
TECHNOLOGIES
SYSTEMS
Technology Element:
Real-time
Data Processing
and Analysis
Technology Element:
Parallel and Distributed
Data Processing
and Analysis
Technology Element:
High-dimensional
and
Multi-modal
Data Processing
and Analysis
Technology Element:
Trusted and
Privacy-preserving
Data Processing
and Analysis
Fundamental Insights
&
rs
FUND
• hetero
• transient
• multi-s
• data wit
privacy,
Requirements
Product and
Outcomes
20. http://cascaderesearch.org
CASCADE team
Name Title Area(s) of Specialization as they relate to proposed
concentration
K. Selcuk Candan Professor Scalable data management and media analysis
Hasan Davulcu Assoc. Professor Databases and data extraction
Gail Joon Ahn Professor Security and privacy in distributed data systems
Huan Liu Professor Data mining and analysis
Ross Maciejewski Assistant Professor Data visualization
Baoxin Li Professor Statistical machine learning, media analysis
Rao Kambhampati Professor Data integration, data cleaning
Chitta Baral Professor Knowledge representation, NLP
Dijuang Huang Associate Professor Data clouds
Hanghang Tong Assistant Professor Graph structured data
Mohamed Sarwat Assistant Professor Data management systems
Jingrui He Assistant Professor Data analysis and sparse learning
Paolo Shakarian Assistant Professor Data and network analysis
Rong Pan Assoc. Professor Data analytics
Jing Li Assoc. Professor Data analytics
Ron Askin Professor Data-driven decision models
Teresa Wu Professor Decision support, health informatics
Ming Zhao Associate Professor Scalable data processing
Adam Doupe Assistant Professor Data security
Paolo Papotti Assistant Professor Data integration and management
21
23. http://cascaderesearch.org
Common approaches to learning
• There are several technical approaches.
• factorization, matrix/tensor decomposition
• probabilistic (Bayesian/graphical model) learning
• deep structured learning and neural networks.
24. http://cascaderesearch.org
• There are several technical approaches.
• factorization, matrix/tensor decomposition
• probabilistic (Bayesian/graphical model) learning
• deep structured learning and neural networks.
Common approaches to learning
25. http://cascaderesearch.org
Tensor analysis…
• Tensor decomposition [CP,Tucker] can be used for
• understanding spectral characteristics of the data and
• clustering the data based on inter-dependencies.
CP-decomposition:
R clusters and
cluster memberships
Factor Matrix
Factor Matrix
Factor Matrix
Core Tensor
26. http://cascaderesearch.org
Tensor representation of data
• Most media and sensor data are
• multi-dimensional and
• multi-relational
• Temporally evolving data…
or
represented as
E.g.
A B C
: : :
a b 2
: : :
a
b
2
1a
b
2
time
Alternative #1: incrementally growing tensor
time
……
Alternative #2: sequence of tensor snapshots
27. http://cascaderesearch.org
Tensor analysis…
• Tensor decomposition [CP,Tucker] can be used for
• understanding spectral characteristics of the data and
• clustering the data based on inter-dependencies.
CP-decomposition:
R clusters and
cluster memberships
Factor Matrix
Factor Matrix
Factor Matrix
Core Tensor
28. http://cascaderesearch.org
Tensor analysis…
• Tensor decomposition [CP,Tucker] can be used for
• understanding spectral characteristics of the data and
• clustering the data based on inter-dependencies.
Tucker-
decomposition:
r1xr2xr3 clusters and
cluster memberships
Factor Matrix
Factor Matrix
Factor Matrix
Core Tensor
Problems:
• these are very computationally expensive operations,
• they are also memory intensive,
• they do not go hand-in-hand with other data
manipulation operations (selection, join, union)
29. http://cascaderesearch.org
Common data characteristics…
• The key characteristics of the real worlddata sets
include the following:
• multi-variate
• multi-modal
• temporal,
• spatial,
• hierarchical,
• graphical
• multi-layer
• multi-resolution
• inter-dependent
• observations of interest depend on and impact each
other
time
35. http://cascaderesearch.org
Research challenges…
Questions:
• how to best account for the different modalities of the
data?
• can we leverage metadata to support multi-resolution
and incremental tensor analysis operations?
• can we implement a memory hierarchy supported
tensor analysis?
• can we co-optimize tensor analysis and other data
manipulation operations?
36. http://cascaderesearch.org
What about other approaches?
• There are several technical approaches.
• factorization, matrix/tensor decomposition
• probabilistic (Bayesian/graphical model) learning
• deep structured learning and neural networks.
….many of the algorithms are based on iterative processes, such as
alternating least squares (ALS) or stochastic gradient descent (SGD), which
approximate the best solution until a convergence condition is reached
Question: Can we develop metadata-supported and multi-scale
techniques that can leverage the volume/cost trade-offs provided by
storage hierarchies to provide high accuracy at minimum cost?
37. http://cascaderesearch.org
Conclusions…
Making sense of a dynamically evolving world is a really
really challenging task……
modeling
organization
storage/indexing
replication
fusion/integration
ingest compression visualization
partitioning
hiding
security encryption
repudiation provenance
authentication
trust models
access control
finger printing
tamper detectionsummarization/aggregation
sampling
cleaning
normalization
annotation
dimensionality reduction
media analysis
machine learning
FUNDAMENTAL
KNOWLEDGE
ENABLING
TECHNOLOGIES
SYSTEMS
Technology Element:
Real-time
Data Processing
and Analysis
Technology Element:
Parallel and Distributed
Data Processing
and Analysis
Technology Element:
High-dimensional
and
Multi-modal
Data Processing
and Analysis
Technology Element:
Trusted and
Privacy-preserving
Data Processing
and Analysis
Fundamental Insights
Partners &
Stakeholders
SystemRequirements
TECHNOLOGY
BARRIERS:
• availability,
• timeliness,
• cost,
• consistency,
• trust,
• privacy,
• security,
• compliance, and
• accessibility
FUNDAMENTAL BARRIERS:
• heterogeneous data and models,
• transient, mobile, and distributed data,
• multi-scale, multi-resolution data,
• data with different quality, precision,
privacy, security, and trust levels, and
• varying data volume and characteristics
• high dimensional, complex data
Requirements
Product and
Outcomes
39. http://cascaderesearch.org
Relevant Publications
• Xinsheng Li, Shenyu Huang, K. Selcuk Candan, Maria Luisa Sapino. 2PCP: Two-Phase CP
Decomposition for Billion-Scale Dense Tensors. IEEE Int. Conference on Data Engineering (ICDE)
2016.
• Jung Hyun Kim, K. Selcuk Candan, Maria Luisa Sapino, PageRank Revisited: On the Relationship
between Node Degrees and Node Significances in Different Applications, International Workshop on
Querying Graph Structured Data (GraphQ'16), in conjunct with EDBT 2016.
• Mijung Kim, K. Selcuk Candan: Decomposition-by-normalization (DBN): leveraging approximate
functional dependencies for efficient CP and tucker decompositions. Data Min. Knowl. Discov, 30(1):
1-46 (2016)
• Shengyu Huang, Xinsheng Li, K. Selcuk Candan, Maria Luisa Sapino: Reducing seed noise in
personalized PageRank. Social Netw. Analys. Mining. 6(1): 6:1-6:25 (2016)
• Mithila Nagendra, K. Selcuk Candan: Efficient Processing of Skyline-Join Queries over Multiple
Data Sources. ACM Trans. Database Syst. 40(2): 10 (2015)
• Jung Hyun Kim, K. Selcuk Candan, Maria Luisa Sapino: Locality-sensitive and Re-use Promoting
Personalized PageRank Computations. Knowledge and Information Systems, pp 1-39, First online:
18 June 2015.
• Parth Nagarkar, K. Selcuk Candan, Aneesha Bhat: Compressed Spatial Hierarchical Bitmap (cSHB)
Indexes for Efficiently Processing Spatial Range Query Workloads. PVLDB 8(12): 1382-1393 (2015)
• Xilun Chen, K. Selcuk Candan, Maria Luisa Sapino, Paulo Shakarian: KSGM: Keynode-driven
Scalable Graph Matching. CIKM 2015: 1101-1110
40. http://cascaderesearch.org
Relevant Publications
• Xilun Chen and K. Selcuk Candan. LWI-SVD: Low-rank, Windowed, Incremental Singular Value
Decompositions on Time-Evolving Data Sets. KDD'14, NY, USA. 2014.
• Xilun Chen and K. Selcuk Candan. GI-NMF: Group Incremental Non-Negative Matrix Factorization
on Data Streams. ACM International Conference on Conference on Information and Knowledge
Management (CIKM'14). Shaghai, China. 2014.
• Mijung Kim and K. Selcuk Candan. Efficient Static and Dynamic In-Database Tensor
Decompositions on Chunk-Based Array Stores. ACM International Conference on Conference on
Information and Knowledge Management (CIKM'14). Shaghai, China. 2014.
• Xinsheng Li, Shenyu huang, K. Selcuk Candan, Maria Luisa Sapino. Focusing Decomposition
Accuracy by Personalizing Tensor Decomposition (PTD). ACM International Conference on
Information and Knowledge Management (CIKM'14). Shanghai, China. 2014.
• Mijung Kim and K. Selcuk Candan. Pushing-Down Tensor Decompositions over Unions to Promote
Reuse of Materialized Decompositions. The European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD'14). Nancy, France.
2014.
• Shengyu Huang, Xinsheng Li, K. Selcuk Candan, Maria Luisa Sapino. “Can you really trust that
seed?”•: Reducing the Impact of Seed Noise in Personalized PageRank. International Conference
on Advances in Social Network Analysis and Mining (ASONAM). Beijing, China. 2014
• Parth Nagarkar and K. Selcuk Candan. HCS: Hierarchical Cut Selection for Efficiently Processing
Queries on Data Columns using Hierarchical Bitmap Indices. EDBT'14: pp. 271-282, 2014.
41. http://cascaderesearch.org
Relevant Publications
• Xiaolan Wang, K. Selcuk Candan, and Maria Luisa Sapino. Leveraging Metadata for Identifying
Local, Robust Multi-variate Temporal (RMT) Features. accepted to ICDE 2014
• Claudio Schifanella, K. Selcuk Candan, and Maria Luisa Sapino. Multiresolution Tensor
Decompositions with Mode Hierarchies. Trans. on Knowledge Discovery from Data (TKDD), ACM
Transactions on Knowledge Discovery from Data (TKDD), 8(2), June 2014.
• Jung W. Kim, K. Selcuk Candan, and M. L. Sapino. LR-PPR: Locality-Sensitive, Re-use Promoting,
Approximate Personalized PageRank Computation. CIKM'13, 2013.
• Mithila Nagendra and K. Selcuk Candan. Layered Processing of Skyline-Window-Join (SWJ)
Queries using Iteration-Fabric. ICDE'13, pp. 985-996, 2013.
• Mithila Nagendra and K. Selcuk Candan. SkySuite: A Framework of Skyline Join Operators for
Static and Stream Environments. VLDB'13, 2013.
• Jung Hyun Kim, Xilun Chen, K. Selcuk Candan, and Maria Luisa Sapino. Hive Open Research
Network Platform, at EDBT'13, pp. 985-996, 2013.
• Mijung Kim, K. Selçuk Candan: SBV-Cut: Vertex-cut based graph partitioning using structural
balance vertices. Data Knowl. Eng. 72: 285-303 (2012)
• Claudio Schifanella, Maria Luisa Sapino, K. Selçuk Candan: On context-aware co-clustering with
metadata support. J. Intell. Inf. Syst. 38(1): 209-239 (2012)
42. http://cascaderesearch.org
Relevant Publications
• K. Selçuk Candan, Rosaria Rossini, Maria Luisa Sapino, Xiaolan Wang: sDTW: Computing DTW
Distances using Locally Relevant Constraints based on Salient Feature Alignments. PVLDB 5(11):
1519-1530 (2012)
• Mijung Kim, K. Selçuk Candan: Decomposition-by-normalization (DBN): leveraging approximate
functional dependencies for efficient Tensor Decomposition. CIKM 2012: 355-364
• Jung Hyun Kim, K. Selçuk Candan, Maria Luisa Sapino: Impact Neighborhood Indexing (INI) in
diffusion graphs. CIKM 2012: 2184-2188
• K. Selçuk Candan, Rosaria Rossini, Maria Luisa Sapino, Xiaolan Wang: STFMap: query- and
feature-driven visualization of large time series data sets. CIKM 2012: 2743-2745
• Mithila Nagendra, K. Selçuk Candan: Skyline-sensitive joins with LR-pruning. EDBT 2012: 252-263
• Songling Liu, Juan P. Cedeño, K. Selçuk Candan, Maria Luisa Sapino, Shengyu Huang, Xinsheng
Li: R2DB: A System for Querying and Visualizing Weighted RDF Graphs. ICDE 2012: 1313-1316.