The volume, variety, and high availability of data backing decision support systems have impacted on business intelligence, the discipline providing strategies to transform raw data into decision-making insights. Such transformation is usually abstracted in the “knowledge pyramid,” where data collected from the real world are processed into meaningful patterns. In this context, volume, variety, and data availability have opened for challenges in augmenting the knowledge pyramid. On the one hand, the volume and variety of unconventional data (i.e., unstructured non-relational data generated by heterogeneous sources such as sensor networks) demand novel and type-specific data management, integration, and analysis techniques. On the other hand, the high availability of unconventional data is increasingly attracting data scientists with high competence in the business domain but low competence in computer science and data engineering; enabling effective participation requires the investigation of new paradigms to drive and ease knowledge extraction. The goal of this thesis is to augment the knowledge pyramid from two points of view, namely, by including unconventional data and by providing advanced analytics. As to unconventional data, we focus on mobility data and on the privacy issues related to them by providing (de-)anonymization models. As to analytics, we introduce a higher abstraction level than writing formal queries. Specifically, we design advanced techniques that allow data scientists to explore data either by expressing intentions or by interacting with smart assistants in hand-free scenarios.
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data and advanced analytics
1. PhD Computer Science and Engineering
Information
World
Data
Knowledge
Wisdom
Augmenting the Knowledge Pyramid
with Unconventional Data & Advanced Analytics
Matteo Francia
Supervisor: Prof. Matteo Golfarelli
Ciclo XXXIII
2. PhD Computer Science and Engineering
Outline
The knowledge pyramid
Augmenting the knowledge pyramid
Part I: Unconventional data
Part II: Advanced analytics
Advanced analytics in hand-free scenarios
Conclusion
Matteo Francia – University of Bologna 2
Information
World
Data
Knowledge
Wisdom
3. PhD Computer Science and Engineering
BI & the knowledge pyramid
Business intelligence
Strategies to transform raw data into decision-making insights
Transformation is usually abstracted in the “knowledge pyramid” [1, 2]
Data: symbols representing real-word objects (e.g., store product sales)
Information: processed data (e.g., query the product with highest profit)
Knowledge: understanding (e.g., mine products often sold together)
Wisdom: knowledge in action (e.g., discount products to optimize profits)
Contribution: augmenting the knowledge pyramid
PART I: unconventional data to improve decision-making
PART II: advanced analytics to climb the pyramid
Matteo Francia – University of Bologna 3
[1] Jennifer E. Rowley: The wisdom hierarchy: representations of the DIKW hierarchy. J. Inf. Sci. 33(2): 163-180 (2007)
[2] Martin Frické: The knowledge pyramid: a critique of the DIKW hierarchy. J. Inf. Sci. 35(2): 131-142 (2009)
World
Data
(Operational DB, OLTP)
Information
(Data warehouse, OLAP)
Knowledge
(Data Mining)
Wisdom
(Decisions)
4. PhD Computer Science and Engineering
Part I: unconventional data
Sensing provides data to support contextual decisions
“World” and “Data” levels
New challenges on unconventional data
Unstructured and non-relational
Transformation requires type-aware techniques
Matteo Francia – University of Bologna 5
World
Knowledge
(Data Mining)
Data
(Operational DB, OLTP)
Information
(Data Warehouse, OLAP)
Wisdom
(Decisions)
Unconventional data
5. PhD Computer Science and Engineering
Contribution: mobility data
Mobility data are at the core of location-based systems
Trajectory: temporal sequence of spatial locations
- Uncertainty: positioning errors
- E.g., GPS (~m) vs GSM (~km)
- Sensitivity: 4 points can identify 95% individuals [1, 2]
- De-anonymize through raw signatures [3]
- De-anonymize through personal gazetteers [4]
Big data applications
- Map matching [5]: project GPS locations to the most-likely road segments
- Profiling [6]: estimate user profiles and income by frequented places
- Precision farming [7]: monitor and coordinate cropping robots
Matteo Francia – University of Bologna 7
[1] Yves-Alexandre De Montjoye, et al.: Unique in the crowd: The privacy bounds of human mobility. Scientific reports 3 (2013): 1376.
[2] Fengmei Jin, Wen Hua, Matteo Francia, Pingfu Chao, Maria E. Orlowska, Xiaofang Zhou: A Survey and Experimental Study on Privacy-Preserving Trajectory Data Publishing. (Under review, TKDE)
[3] Fengmei Jin, Wen Hua, Thomas Zhou, Jiajie Xu, Matteo Francia, Maria E. Orlowska, Xiaofang Zhou: Trajectory-Based Spatiotemporal Entity Linking. IEEE Trans. on Know. and Data Eng. (2020).
[4] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Nicola Santolini: DART: De-Anonymization of personal gazetteers through social trajectories. J. Inf. Secur. Appl. 55: 102634 (2020)
[5] Matteo Francia, Enrico Gallinucci, Federico Vitali: Map-Matching on Big Data: a Distributed and Efficient Algorithm with a Hidden Markov Model. MIPRO 2019: 1238-1243
[6] Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Summarization and visualization of multi-level and multi-dimensional itemsets. Inf. Sci. 520: 63-85 (2020)
[7] Giuliano Vitali, Matteo Francia, Matteo Golfarelli, Maurizio Canavari: Crop Management with the IoT: An Interdisciplinary Survey. Agronomy 11.1 (2021): 181.
A B C D
1
2
3
4
Tb
Tg
Tr
6. PhD Computer Science and Engineering
Part II: advanced analytics
High availability and accessibility attract new data scientists
High competence in business domain
Low competence in computer science
Since the ’70s, relational queries to retrieve data
Comprehension of formal languages and DBMS
Advanced analytics (semi-automatic transformation)
- “Information” and “Knowledge” levels
Matteo Francia – University of Bologna 8
Advanced analytics
Intention
Hand-free scenarios
Data summaries
World
Knowledge
(Data Mining)
Data
(Operational DB, OLTP)
Information
(Data Warehouse, OLAP)
Wisdom
(Decisions)
7. PhD Computer Science and Engineering
Contribution: advanced analytics
Hand-free scenarios
Augmented OLAP [1]: recommendation in augmented reality
Conversational OLAP [2, 3]: interpret natural language queries
Express high-level analytic abstractions, not queries
E.g., describe [4, 5] interesting patterns of sales
E.g., assess [6] Italian sales against French sales
Data summaries
Summarization based on multidimensional similarity [7]
Conceptual model for data narratives [8, 9]
Matteo Francia – University of Bologna 9
[1] Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A-BI+. A framework for Augmented Business Intelligence. Inf. Syst. 92: 101520 (2020)
[2] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: COOL: A framework for conversational OLAP. Inf. Syst. 101752. (2021)
[3] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Conversational OLAP in Action. EDBT 2021: 646-649
[4] Antoine Chédin, Matteo Francia, Patrick Marcel, Veronika Peralta, and Stefano Rizzi. The tell-tale cube. ADBIS, 2020.
[5] Matteo Francia, Patrick Marcel, Verónika Peralta, Stefano Rizzi: Enhancing Cubes with Models to Describe Multidimensional Data. Information Systems Frontiers (2021)
[6] Matteo Francia, Matteo Golfarelli, Patrick Marcel, Stefano Rizzi, Panos Vassiliadis: Assess Queries for Interactive Analysis of Data Cubes. EDBT 2021: 121-132
[7] Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Summarization and visualization of multi-level and multi-dimensional itemsets. Inf. Sci. 520: 63-85 (2020)
[8] Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, Panos Vassiliadis: Towards a Conceptual Model for Data Narratives. ER 2020: 261-270
[9] Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, Panos Vassiliadis: Supporting the Generation of Data Narratives. ER Forum/Posters/Demos 2020: 168-172
8. PhD Computer Science and Engineering
Information
World
Data
Knowledge
Wisdom
Advanced analytics
Augmented OLAP
Matteo Francia – University of Bologna 10
Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A-BI+: A framework for Augmented Business Intelligence. Inf. Syst. 92: 101520 (2020)
Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Augmented Business Intelligence. DOLAP 2019
9. PhD Computer Science and Engineering
Application scope
Enable analytics on augmented reality
E.g., an inspector analyzing production rates
Sense the context through augmented devices
E.g., smart glasses
Detect interaction and engagement [1]
Produce analytical reports
Relevant to the sensed context
Cardinality constraint
Near real-time
Matteo Francia – University of Bologna 11
?
Analytical Reports
[1] Yu-Chuan Su, Kristen Grauman: Detecting Engagement in Egocentric Video. ECCV (5) 2016: 454-471
10. PhD Computer Science and Engineering
Data Mart: repository of multidimensional cubes
Cubes representing business facts
Data dictionary
What we can recognize (i.e., md-elements)
Context: subset of md-elements
Mappings to sets of md-elements
A-priori interest
What can we sense?
Matteo Francia – University of Bologna 13
Date
Year
Product
Type
Category
City
Sales
Quantity
Revenues
Assembly
AssembledItems
AssemblyTime
Part
Context
<Object, Seat> dist = 1m
<Object, BikeExcite> dist = 2m
<Location, RoomA.1>
<Date, 16/10/2018>
<Role, Controller>
Date
Month
Year
Product
Type
Category
Family
Month
Store
Device
Dictionary
11. PhD Computer Science and Engineering
Recommendation
Context interpretation
Given context T over the data dictionary
Project T to an image of fragments I through mappings
- Fragment: intuitively a “small” query
Add the log
Get queries with positive feedback from similar contexts
- Enrich I to I* with unperceived elements from T
Each fragment has contextual and log relevance
Query generation
Cannot directly translate I* into a well-formed query
High cardinality I* = hardly interpretable “monster query”
Matteo Francia – University of Bologna 14
Query
generation
Context relevant queries
recommended
queries
Query selection
<Object, Seat> dist = 1m
<Object, BikeExcite> dist = 2m
<Location, RoomA.1>
<Date, 16/10/2018>
<Role, Controller>
Log
Analytical Reports
user’s
feedback
12. PhD Computer Science and Engineering
Query generation
Generate queries from image I* of fragments
Each fragment is a query
Depth-first exploration with pruning rules
- Query cardinality can only increase
- Some queries are redundant
Matteo Francia – University of Bologna 15
I*
μ(T)
{Month},
{},
{AssembledItems}
{Product},
{(Product=BikeExcite)},
{Quantity}
{Part,Type},
{(Type=Bike)},
{}
{Part,Product},
{(Product=BikeExcite)},
{Quantity}
{Month,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Year},
{},
{AssembledItems}
{Year,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Month,Part,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Year,Part,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Month,Part,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Month,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Month,Part,Type},
{(Type=Bike)},
{AssembledItems}
{Year,Part,Type},
{(Type=Bike)},
{AssembledItems}
{Month,Part,Type},
{(Type=Bike)},
{AssembledItems}
{Month},
{},
{AssembledItems}
Fragments
13. PhD Computer Science and Engineering
Query selection
Given #queries (rq), maximize the covered fragments and minimize their overlapping
E.g., given two queries q and q’
rel(q) + rel(q’) – sim(q, q’) * (rel(q) + rel(q’)) / 2
Weighted Maximum Coverage Problem (NP-hard)
Greedy: iteratively pick query maximizing relT
- Only a few query are retrieved, not expensive
Matteo Francia – University of Bologna 16
q
I*
μ(T)
q'
14. PhD Computer Science and Engineering
Test set up
Cube with 109 md-elements
Simulate user moving inside a factory
Given fixed context and query target
Assess similarity of the proposed query in similar contexts
𝛽: context similarity
sim: proposed/target query similarity
Effectiveness
Matteo Francia – University of Bologna 17
Best query (with user exp.)
After 2 visits: 0.95, 4 visits: 0.98
Best query (no user exp.)
|T| = 12, rq = 4
Target context Similar context
15. PhD Computer Science and Engineering
Research directions
OLAP in augmented reality
Support analytical queries in hand-free scenarios
Recommend relevant data facts from a real-world context
Research directions
Provide (fast) query previews
- Estimate the execution time of each query
- Address query caching and multi-query optimization issues
Correlate context-awareness to data quality [3]
- Relevance, amount, and completeness [4]
Matteo Francia – University of Bologna 19
[3] Stephanie Watts, Ganesan Shankaranarayanan, Adir Even: Data quality assessment in context: A cognitive perspective. Decis. Support Syst. 48(1): 202-211 (2009)
[4] Diane M. Strong, Yang W. Lee, Richard Y. Wang: Data Quality in Context. Commun. ACM 40(5): 103-110 (1997)
16. PhD Computer Science and Engineering
Information
World
Data
Knowledge
Wisdom
Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: COOL: A framework for conversational OLAP. Inf. Syst. 101752. (2021)
Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Conversational OLAP in Action. EDBT 2021 (best demo award): 646-649
Advanced analytics
Matteo Francia – University of Bologna 22
Conversational OLAP
17. PhD Computer Science and Engineering
Motivation
Enable analytics through natural language
OLAP provides low-level operators [1]
Users need to have knowledge on the multidimensional model…
… or even programming skills
We introduce COOL (COnversational OLap) [3]
Translate natural language into formal queries
Matteo Francia – University of Bologna 23
[1] Panos Vassiliadis, Patrick Marcel, Stefano Rizzi: Beyond roll-up's and drill-down's: An intentional analytics model to reinvent OLAP. Information Systems. (2019)
[2] Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A-BI+: A framework for Augmented Business Intelligence. Information Systems. (2020)
[3] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: COOL: A Framework for Conversational OLAP. Information Systems. (2021)
18. PhD Computer Science and Engineering
COOL: architecture
Matteo Francia – University of Bologna 24
Automatic
KB feeding
Manual KB
enrichment KB
DW
Metadata
& values
Synonyms
Offline
Online
Synonyms
Ontology
19. PhD Computer Science and Engineering
COOL: architecture
Matteo Francia – University of Bologna 25
Speech-
to-Text
OLAP
operator
Full query
Disambiguation
& Enhancement
Execution &
Visualization
Automatic
KB feeding
Manual KB
enrichment
Raw
text
Annotated
parse forest
Parse
tree
Metadata
& values
Synonyms
Log
Interpretation
Offline
Online
Synonyms
Ontology
SQL
generation
SQL
Sales by
Customer and
Month
Parse tree
Statistics
KB
DW
21. PhD Computer Science and Engineering
Effectiveness
40 users with heterogeneous OLAP skills
Asked to translate (Italian) analytic goals into English
Users provided good feedback on the interface...
... as well as on the interpretation accuracy
Matteo Francia – University of Bologna 31
Full Query OLAP operator
OLAP Familiarity Accuracy Time (s) Accuracy Time (s)
Low 0.91 141 0.86 102
High 0.91 97 0.92 71
22. PhD Computer Science and Engineering
COOL in Action!
Matteo Francia – University of Bologna 33
[3] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Conversational OLAP in Action. EDBT (best demo award) 2021: 646-649
23. PhD Computer Science and Engineering
Research directions
COOL (Conversational OLAP)
Support the translation of a natural language conversation into an OLAP session
Analyze data without requiring technological skills
- Add conversational capabilities to Augmented OLAP
Towards an end-to-end conversational solution
Create query summaries that can be returned as short vocal messages
Identify insights out of a large amount of data
Identify the “right” storytelling and user-system interaction
Matteo Francia – University of Bologna 36
24. PhD Computer Science and Engineering
Conclusion
Data scientists have heterogeneous background
The need for high-level analytic abstractions and interfaces is well-understood
Advanced analytics work towards (semi-)autonomous data transformation
Data management should be (semi-)automated as well
- Orchestrate data platforms, maintain data lineage, profile data
Unconventional mobility data
Handle trajectory variety and semantic is troublesome
- Difference in sampling rates, speed, accuracy, transportation means
- We need a unifying framework for storage and analysis
Privacy of spatio-temporal data is a concern
- Besides protection, we need scalable solutions
Matteo Francia – University of Bologna 37
25. PhD Computer Science and Engineering
Publications
Journal articles
1. Matteo Francia, Patrick Marcel, Verónika Peralta, Stefano Rizzi: Enhancing Cubes with
Models to Describe Multidimensional Data. Information Systems Frontiers (2021)
2. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: COOL: A framework for
conversational OLAP. Information Systems (2021)
3. Giuliano Vitali, Matteo Francia, Matteo Golfarelli, Maurizio Canavari: Crop Management
with the IoT: An Interdisciplinary Survey. Agronomy (2021)
4. Fengmei Jin, Wen Hua, Thomas Zhou, Jiajie Xu, Matteo Francia, Maria E. Orlowska,
Xiaofang Zhou: Trajectory-Based Spatiotemporal Entity Linking. IEEE Transactions on
Knowledge and Data Engineering (2020).
5. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Nicola Santolini: DART: De-
Anonymization of personal gazetteers through social trajectories. Journal of
Information Security and Applications. 55: 102634 (2020)
6. Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A-BI+: A framework for Augmented
Business Intelligence. Information Systems 92: 101520 (2020)
7. Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Summarization and visualization of
multi-level and multi-dimensional itemsets. Information Sciences 520: 63-85 (2020)
8. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Social BI to understand the debate
on vaccines on the Web and social media: unraveling the anti-, free, and pro-vax
communities in Italy. Social Network Analysis and Mining 9(1): 46:1-46:16 (2019)
Conference papers
1. Matteo Francia, Matteo Golfarelli, Patrick Marcel, Stefano Rizzi, Panos Vassiliadis:
Assess Queries for Interactive Analysis of Data Cubes. EDBT 2021: 121-132
2. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Conversational OLAP in Action.
EDBT 2021: 646-649 (best demo award)
3. Antoine Chédin, Matteo Francia, Patrick Marcel, Verónika Peralta, Stefano Rizzi: The
Tell-Tale Cube. ADBIS 2020: 204-218
4. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Towards Conversational OLAP.
DOLAP 2020: 6-15
5. Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, Panos Vassiliadis:
Supporting the Generation of Data Narratives. ER Forum/Posters/Demos 2020: 168-172
6. Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, Panos Vassiliadis:
Towards a Conceptual Model for Data Narratives. ER 2020: 261-270
7. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi: OLAP Querying of
Document Stores in the Presence of Schema Variety. SEBD 2020: 128-135
8. Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Augmented Business Intelligence.
DOLAP 2019
9. Matteo Francia, Enrico Gallinucci, Federico Vitali: Map-Matching on Big Data: a
Distributed and Efficient Algorithm with a Hidden Markov Model. MIPRO 2019: 1238-1243
10. Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A Similarity Function for Multi-Level and
Multi-Dimensional Itemsets. SEBD 2018
11. Matteo Francia, Danilo Pianini, Jacob Beal, Mirko Viroli: Towards a Foundational API for
Resilient Distributed Systems Design. FAS*W@SASO/ICCAC 2017: 27-32
Matteo Francia – University of Bologna 38
26. PhD Computer Science and Engineering
Thank you.
Information
World
Data
Knowledge
Wisdom
Questions?
Matteo Francia – University of Bologna 39
Hinweis der Redaktion
insight -> intuizione
decision making -> processo decisionale
Yves-Alexandre De Montjoye, et al.: Unique in the crowd: The privacy bounds of human mobility. Scientific reports 3 (2013): 1376.
We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly and with a spatial resolution equal to that given by the carrier's antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals.
Sistemi come recommender system di Amazon possono usare dati contestuali (e.g, la posizione), tuttavia
Ci sono differenze sia differenze di «metodo» / «framework» che di «recommendation»
«Metodo»
- Amazon si basa su verità «più storiche», noi interpretiamo e «mixiamo» un contesto real-time costituito da più oggetti interessanti rilevati (e/o ingaggiati) dal sistema
- Il nostro sistema è «end-to-end», cioè riguarda anche la gestione e linking dei dati per la costruzione delle query
«Recommendation», formalmente noi usiamo un approccio ibrido (mentre i classici sono item-based o collaborative)
- Mix di conoscenza real-time con storica: Non siamo strettamente log-based (i.e., il contesto ci serve per un cold-start problem). Mentre il consiglio di amazon è «altri utenti hanno acquistato/visualizzato anche…»- Cardinalità del risultato per fare fit di un device augmented- Diversification di query diverse, non di una singola query
DIFF: [17] returns tuples that maximize difference between cells of a cube given as input
Profile user exploration to recommend which unvisited parts of the cube
RELAXoperator allows toverify whether a pattern observed at a certain level of detail ispresent at a coarser level of detail too [19]
Alternative operators have also been proposed in theCinecubes method [7,8]. The goal of this effort is to facilitateautomated reporting, given an original OLAP query as input.To achieve this purpose two operators (expressed asacts) areproposed, namely, (a)put-in-context, i.e., compare the result ofthe original query to query results over similar, sibling values;and (b)give-details, where drill-downs of the original query’sgroupers are performed.
DIFF: [17] returns tuples that maximize difference between cells of a cube given as input
Profile user exploration to recommend which unvisited parts of the cube
RELAXoperator allows toverify whether a pattern observed at a certain level of detail ispresent at a coarser level of detail too [19]
Alternative operators have also been proposed in theCinecubes method [7,8]. The goal of this effort is to facilitateautomated reporting, given an original OLAP query as input.To achieve this purpose two operators (expressed asacts) areproposed, namely, (a)put-in-context, i.e., compare the result ofthe original query to query results over similar, sibling values;and (b)give-details, where drill-downs of the original query’sgroupers are performed.
Jagadish: The linguistic parse trees in our system are dependency parse trees, in which each node is a word/phrase specified by the user while each edge is a linguistic dependency relationship be- tween two words/phrases. The
Jagadish: The linguistic parse trees in our system are dependency parse trees, in which each node is a word/phrase specified by the user while each edge is a linguistic dependency relationship be- tween two words/phrases. The
Jagadish: The linguistic parse trees in our system are dependency parse trees, in which each node is a word/phrase specified by the user while each edge is a linguistic dependency relationship be- tween two words/phrases. The