Full thesis (open access): http://openarchive.univpm.it/jspui/handle/123456789/494
Abstract
-------------
Knowledge Discovery in Databases (KDD), as well as scientific experimentation in e-Science, is a complex and computationally intensive process aimed at gaining knowledge from a huge set of data. Often performed in distributed settings, KDD projects usually involve a deep interaction among heterogeneous tools and several users with specific expertise. Given the high complexity of the process, such users need effective support to achieve their goal of knowledge extraction.
This work presents the Knowledge Discovery in Databases Virtual Mart (KDDVM), a user- and knowledge-centric framework aimed at supporting the design of KDD processes in a highly distributed and collaborative scenario, in which computational resources and actors dynamically interoperate to share and elaborate knowledge. The contribution of the work is two-fold: firstly, a conceptual systematization of the relevant knowledge is provided, with the aim to formalize, through semantic technologies, each element taking part in the design and execution of a KDD process, including computational resources, data and actors; secondly, we propose an implementation of the framework as an open, modular and extensible Service-Oriented platform, in which several services are available both to perform basic operations of data manipulations and to support more advanced functionalities. Among them, the management of deployment/activation of computational resources, service discovery and their composition to build KDD processes. Since the cooperative design and execution of a distributed KDD process typically require several skills, both technical and managerial, collaboration can easily become a source of complexity if not supported by any kind of coordination. For such reasons, a set of functionalities of the platform is specifically addressed to support collaboration within a distributed team, by providing an environment in which users can work on the same project and share processes, results and ideas.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Kdd Process Design in Collaborative and Distributed Environments
1. Università Politecnica delle Marche
Scuola di dottorato in Scienze dell’Ingegneria
Curriculum in Ingegneria Informatica, Gestionale e dell’Automazione
KDD Process Design in Collaborative
and Distributed Environments
Emanuele Storti
Advisor: Prof.ssa Claudia Diamantini
Curriculum supervisor: Prof. Sauro Longhi
28 Febbraio 2012
2. Summary
I. Introduction
• Background & Motivation
• Related Work
• Research Question
II. Approach
III. Knowledge Layer
IV. Platform
• Service Discovery
• Process Composition
• Collaboration features
• Formal Verification of processes
V. Conclusion
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments
3. Introduction
“Knowledge is the common wealth of humanity”*
Access to vast collections of data is the key to
understanding and responding to complex problems
Growing capability in data production: ge
Data Delu
• availability of cost-effective technologies
• enhancement of communication infrastructures
Not only organizations, also e-Science
• global problems ask for data-intensive computation (genomics, physics, climatology)
• global effort through world-wide scientific collaboration
Need of tools for supporting distribution and collaboration in data analysis
* A. Samassekou, UN World Summit on the Information Society
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 1
4. Background
Knowledge Discovery in Databases (KDD)
Process of identifying valid, novel, potentially useful patterns in data
[Fayyad, 1996]
• knowledge refinement
• many steps, several iterations
• strong user interaction
• need of user knowledge
Support to:
• decision making in organizations
• e-Science experimentations
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 2
5. Related Work
Early proposals (Intelligent Data Analysis systems):
• local frameworks
• single-user
• predefined set of tools (little extensibility)
Distribution of tools & computational aspects:
• SOA [Ali, 2005] [Kumar, 2005]
• Grid [Kgrid, 2003]
• OGSA: SOA + Grid
Distribution of users & evolution of organizations:
• user support in advanced activities [Bernstein, 2005] [Žáková, 2011]
• collaboration [myExperiment, 2009]
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 3
6. Motivation
Main issues in open, distributed and collaborative scenarios
• Localization of many available distributed tools
• Integration of heterogeneous tools
(interfaces, programming languages, OSs, transfer protocols,…)
• Complexity in their usage
(data preparation, I/O interpretation, precondition satisfaction,
process design), many possible combinations
• Coordination in cooperative work (source of complexity)
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 4
7. Research Question
How to support a community of users in the design
of a KDD project in an open, heterogeneous,
distributed and collaborative scenario?
1. Which kind of support and functionalities should a platform offer?
2. Which principles should the platform be based on?
3. Which resources are involved in a KDD process and how to represent them?
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 5
8. Functional Requirements
1. Support functionalities that should be offered by the
platform:
• importing new heterogeneous tools for several KDD tasks
• retrieving tools useful for certain purposes
• designing a process by connecting tools together
• understanding whether such connections are meaningful
• suggesting tool’s sequences effective for a given goal
• execution of the process
• supporting collaboration all through the design process
• co-operative design
• communication among team’s members
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 6
9. Non-functional Requirements
2. Principles on which the platform should be based
Requirements
• Interoperability
Main issues
• Flexibility
• heterogeneity
• Modularity
• complexity
• Reusability
• distribution
• Transparency
• coordination
• Usability
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 7
10. Resources in a KDD project
3. Identification of resources involved in a KDD process
• Computational units (gathering, preparation, modeling, visualization, …)
• Data and Models (dataset, input parameters, intermediate results, final model)
• Actors (domain experts, DB/DWH administrators, DM and KDD experts)
• Computational processes
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 8
11. Knowledge- and user-centric approach
Service-Oriented platform
Basic Services Support Services
Services for every KDD phase. • back-end functionalities
• tools are wrapped as services, • advanced functionalities
• deployed on a server,
• published in a common repository
Collaborative functionalities
• co-operation
running as
• communication facilities
tool service
Knowledge Layer
Data model: systematization of knowledge regarding each KDD resource, from
both a theoretical and an operational perspective
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 9
12. Knowledge Layer
Methodology
• degrees of abstraction for description of resources
• semantic technologies for their representation
• formal ontology building methodology [Fernandez, 1997] [Staab, 2001]
• quality requirements [Gruber, 1995]
Conceptual: abstract representation Domain ontologies
Semantically annotated
Concrete: actual resources of the platform
XML descriptors
Execution: logs and traces
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 10
13. Knowledge Layer > Computational units
[IDA 2009] [ECML 2009]
KDDONTO
Performance Index
algorithm
Method Algorithm Data
implemented as
Task Phase
tool
• algorithms arranged in a taxonomy
running as
• method, task, KDD phase, performances
• data arranged in a taxonomy
• algorithms’ interfaces (I/O, pre/post-conditions)
service •…
Implemented in OWL-DL (ALCOIF expressivity)
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 11
14. Knowledge Layer > Computational units
<location>
SERVICE <I/O>
Algorithm alg <algorithm>
WSDL url <performance>
Qos val <author>
algorithm UDDI eSAWSDL
implemented as SAWSDL: Semantic annotations for WSDL (w3c)
• location (<xmls:impl>)
• I/O <message>: type, syntax
Further extensions:
tool • implemented <algorithm>
• <performance>
running as
• <author>
•…
service Descriptors are stored in a UDDI registry
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 12
15. Knowledge Layer > Computational units
Mappings between abstraction levels
• KDDONTO provides a shared vocabulary which services refer to
• such mappings can support process composition
KDDONTO
Remove missing values Labeled Dataset C4.5
algorithm algorithm
<location> <location>
<output> <input>
<algorithm> <algorithm>
<performance>
<author>
abc <performance>
<author>
eSAWSDL1 eSAWSDL2
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 13
16. Knowledge Layer > Actors
TeamONTO is a formal ontology devoted to represent details of
actors [CTS 2012]
TeamONTO
hasSkill writes
Algorithm Person Publication
hasSkill worksIn
WebService Organization
hasSkill memberOf
Domain Project
about
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 14
17. Knowledge Layer > Processes
XML-descriptor for concrete level processes
• <services> included in the process
• <connection> among their I/O interfaces
• <users> in charge of settings service parameters
• metadata: creation <date>/<time>, <author>, <comments>
Process descriptors are stored in a Process Repository
<location>
<I/O>
<algorithm>
<performance>
<author>
<location> <location>
<I/O> eSAWSDL 2 <I/O>
<algorithm> <algorithm>
<performance> <performance>
<author> <author>
<location>
eSAWSDL 1 <I/O> eSAWSDL 4 project
<algorithm>
<performance>
<author>
ProcRep index
eSAWSDL 3
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 15
18. Knowledge Layer > Processes
Process Repository indexing method [SEBD 2011][SAC 2012]
Application of graph-based hierarchical clustering [Jonyer, 2002]
• extracts the most frequent and common subprocesses
• arrange them in a lattice
hierarchical clustering
ProcRep
index
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 16
19. Knowledge Layer > Processes
Process Repository indexing method [SEBD 2011][SAC 2012]
Application of graph-based hierarchical clustering [Jonyer, 2002]
• extracts the most frequent and common subprocesses
• arrange them in a lattice
Support for process retrieval: subgraph isomorphism on the index is
more efficient than on the whole repository
graph matching
ProcRep
User query index
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 16
20. Knowledge Layer > Overview
KDDONTO TeamONTO
Algorithm Algorithm Person
Performance Data WebService
WebService Project
<location> <location>
<I/O> <I/O>
<algorithm> <algorithm>
<performance> <performance>
SERVICE <author> <author>
Algorithm alg eSAWSDL 1 eSAWSDL 2
WSDL url project
Qos val Process
UDDI ProcRep index
Advantages: loose-coupling, modularity, reusability, support to advanced functions
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 17
21. KDDVM Platform
Service-oriented platform for KDD process design
[CTS 2010][CTS 2011][IFS]
collaborative
functionalities
support
services
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 18
22. KDDVM Platform > KDDDesigner
• integration point of other support services
• platform front-end
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 19
23. Service Discovery
Retrieval of KDD services satisfying user requirements
• syntactic service discovery (service name)
• semantic service discovery (functionalities, goal, interface,…)
1. Search, in KDDONTO, for algorithm satisfying the requirements
2. Search, inside UDDI, for services implementing such algorithms
KDDONTO
Algorithm
Data Goal
SERVICE <location>
<I/O>
Algorithm alg <algorithm>
WSDL url <performance>
Qos val <author>
eSAWSDL
UDDI
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 20
24. Service Discovery
Retrieval of KDD services satisfying user requirements
• syntactic service discovery (service name)
• semantic service discovery (functionalities, goal, interface,…)
1. Search, in KDDONTO, for algorithm satisfying the requirements
2. Search, inside UDDI, for services implementing such algorithms
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 21
25. Process Composition
Verification of I/O data compatibility
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 22
26. Process Composition: matchmaking
Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]
• syntactic compatibility: comparison between service descriptors
KDDONTO
<location> <location>
<I/O> <I/O>
<algorithm> <algorithm>
Same format?
<performance> abc <performance>
<author> Same syntax? <author>
eSAWSDL1 eSAWSDL2
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 23
27. Process Composition: matchmaking
Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]
• syntactic compatibility: comparison between service descriptors
• semantic compatibility: comparison between ontological annotations of the
services (kind of match between I/O, preconditions/postconditions...)
KDDONTO
same concept?
data1 subconcept? data2 Output:
part-of concept?
match cost
<location> <location>
<I/O> <I/O> match evaluation
<algorithm> <algorithm> function:
<performance> abc <performance>
<author> <author> • kind of match
eSAWSDL1 eSAWSDL2 • preconditions
• etc…
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 23
28. Process Composition: matchmaking
Usage of Matchmaking:
• provide the user with information about the validity of the match
• find all services compatible with a given one
• support an advanced functionality for semi-automatic process
composition
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 24
29. Process Composition: semi-automatic procedure
Procedure to generate conceptual processes [IDA 2009] [ITAIS 2009] [ECAI 2010]
• planning technique: goal-driven, backwards strategy
• basic step based on matchmaking
• pruning criteria, stop criteria, user constraints
• results: abstract processes useful as templates, ranked according to a metric
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 25
30. Collaborative Features
Collaborative features for process design within a team [CTS 2011]
1. Team building: to find users with certain skills & build a team
2. Multi-user Process design and Versioning
3. Communication
4. Task assignment
Retrieval of users with certain competencies, inside TeamONTO
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
31. Collaborative Features
Collaborative features for process design within a team [CTS 2011]
1. Team building: to find users with certain skills & build a team
2. Multi-user Process design and Versioning
3. Communication
4. Task assignment
Asynchronous edit of a process, and its storage as a new version in the repository
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
32. Collaborative Features
Collaborative features for process design within a team [CTS 2011]
1. Team building: to find users with certain skills & build a team
2. Multi-user Process design and Versioning
3. Communication
4. Task assignment
Talk page & comments attached to versions
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
33. Collaborative Features
Collaborative features for process design within a team [CTS 2011]
1. Team building: to find users with certain skills & build a team
2. Multi-user Process design and Versioning
3. Communication
4. Task assignment
Assignment to a user of the team the parameter setting for the service execution
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
34. Formal Verification of Processes
1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
• language for representation of interaction among services
• formal and compositional semantics
2. Specification of the desider behavior through mCRL2 [Groote, 2006]
3. Model-checking to verify whether the process satisfies the behavior
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
35. Formal Verification of Processes
1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
• language for representation of interaction among services
• formal and compositional semantics
2. Specification of the desider behavior through mCRL2 [Groote, 2006]
3. Model-checking to verify whether the process satisfies the behavior
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
36. Formal Verification of Processes
1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
• language for representation of interaction among services
• formal and compositional semantics
2. Specification of the desider behavior through mCRL2 [Groote, 2006]
3. Model-checking to verify whether the process satisfies the behavior
Purposes [SEBD 2010]
• formal verification of KDD processes at design-time
• reuse of typical pre-verified KDD subprocesses
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
37. Conclusion
KDD process design in distributed and collaborative environments
• Distributed and heterogeneous settings:
• systematic use of semantic information all through the process
• loosely-coupled service architecture
• Community-centered approach:
• support functionalities
• users with various degrees of experience
• General-purpose attitude:
• domain-indipendence
• generalization to experimental processes in e-Science
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 28
38. References
[Fayyad, 1996] “From data mining to knowledge discovery: an overview”, American
Association for Artificial Intelligence.
[Ali, 2005] “Web services composition for distributed data mining”, Parallel Processing
[Kgrid, 2003] “The knowledge grid”, Comm. ACM
[Bernstein, 2005] “Towards Intelligent Assistance for a Data Mining Process: An
Ontology Based Approach for Cost-Sensitive Classification”, IEEE TKDE
[Žáková, 2011] “Automating Knowledge Discovery Workflow Composition Through
Ontology-Based Planning”, IEEE T. Automation Science and Engineering
[myExperiment, 2006] “The Design and Realisation of the Virtual Research Environment
for Social Sharing of Workflows”, FGCS
[Fernandez, 1997] “METHONTOLOGY: from Ontological Art towards Ontological
Engineering”, Proc. of the AAAI97 Spring Symposium.
[Staab, 2001] “Knowledge Processes and Ontologies”, IEEE Intelligent Systems.
[Gruber, 1995] “Toward principles for the design of ontologies used for knowledge
sharing”, Int. J. Hum.-Comput. Stud.
[Jonyer, 2002] “Graph-based hierarchical conceptual clustering”, J. Mach. Learn. Res.
[Groote, 2006] “The formal specification language mcrl2”, in MMOSS.
[Arbab, 2004] “Reo: a channel-based coordination model for component composition”,
Mathematical. Structures in Comp. Sci.
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments
39. Publications
[IDA 2009] “Ontology-driven KDD Process Composition”. LNCS, Springer, 2009.
[ECML 2009] “KDDONTO: an Ontology for Discovery and Composition of KDD
Algorithms”. ECML/PKDD09 Workshop.
[ITAIS 2009] “Automatic Definition of KDD Prototype Process by Composition”.
“Management of Interconnected World”, Springer.
[CTS 2010] “Semantic-Driven Design and Management of KDD Processes”. CTS 2010,
IEEE.
[SEBD 2010] “Towards Coordination Patterns for Complex Experimentations in Data
Mining”. SEBD 2010.
[ECAI 2010] “Supporting Users in KDD Process Design. A Semantic Similarity Matching
Approach”. Planning To Learn Workshop in ECAI2010.
[CTS 2011] “Semantic-Aided Designer for Knowledge Discovery”. CTS 2011, IEEE.
[SEBD 2011] “Clustering of process schema by graph mining techniques”. SEBD 2011.
[SAC 2012] “Mining Usage Patterns from a Repository of Scientific Workflows”.
SAC2012, ACM.
[CTS 2012] “Semantically-supported Team Building in a KDD Virtual Environment”. CTS
2012, IEEE.
[ISF] “A Virtual Mart for Knowledge Discovery in Databases”. Information System
Frontiers, Springer, accepted.
10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments