SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Università Politecnica delle Marche
         Scuola di dottorato in Scienze dell’Ingegneria
         Curriculum in Ingegneria Informatica, Gestionale e dell’Automazione




      KDD Process Design in Collaborative
            and Distributed Environments
                          Emanuele Storti



Advisor: Prof.ssa Claudia Diamantini
Curriculum supervisor: Prof. Sauro Longhi


                                     28 Febbraio 2012
Summary

   I. Introduction
           •   Background & Motivation
           •   Related Work
           •   Research Question
   II. Approach
   III. Knowledge Layer
   IV. Platform
           •   Service Discovery
           •   Process Composition
           •   Collaboration features
           •   Formal Verification of processes
   V. Conclusion

10/12/12       Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments
Introduction

   “Knowledge is the common wealth of humanity”*
   Access to vast collections of data is the key to
   understanding and responding to complex problems


   Growing capability in data production:                                                       ge
                                                                                      Data Delu
   • availability of cost-effective technologies
   • enhancement of communication infrastructures

   Not only organizations, also e-Science
   • global problems ask for data-intensive computation (genomics, physics, climatology)
   • global effort through world-wide scientific collaboration

    Need of tools for supporting distribution and collaboration in data analysis


 * A. Samassekou, UN World Summit on the Information Society
10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments     1
Background

   Knowledge Discovery in Databases (KDD)
   Process of identifying valid, novel, potentially useful patterns in data
   [Fayyad, 1996]


                                                  • knowledge refinement
                                                  • many steps, several iterations
                                                  • strong user interaction
                                                  • need of user knowledge



                                                  Support to:
                                                  • decision making in organizations
                                                  • e-Science experimentations

10/12/12     Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   2
Related Work

   Early proposals (Intelligent Data Analysis systems):
   • local frameworks
   • single-user
   • predefined set of tools (little extensibility)


   Distribution of tools & computational aspects:
   • SOA [Ali, 2005] [Kumar, 2005]
   • Grid [Kgrid, 2003]
   • OGSA: SOA + Grid


   Distribution of users & evolution of organizations:
   • user support in advanced activities [Bernstein, 2005] [Žáková, 2011]
   • collaboration [myExperiment, 2009]




10/12/12       Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   3
Motivation

   Main issues in open, distributed and collaborative scenarios

   • Localization of many available distributed tools

   • Integration of heterogeneous tools
   (interfaces, programming languages, OSs, transfer protocols,…)

   • Complexity in their usage
   (data preparation, I/O interpretation, precondition satisfaction,
     process design), many possible combinations

   • Coordination in cooperative work (source of complexity)



10/12/12    Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   4
Research Question

   How to support a community of users in the design
   of a KDD project in an open, heterogeneous,
   distributed and collaborative scenario?

      1. Which kind of support and functionalities should a platform offer?
      2. Which principles should the platform be based on?
      3. Which resources are involved in a KDD process and how to represent them?




10/12/12       Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   5
Functional Requirements

   1. Support functionalities that should be offered by the
   platform:

    • importing new heterogeneous tools for several KDD tasks
    • retrieving tools useful for certain purposes
    • designing a process by connecting tools together
           • understanding whether such connections are meaningful
           • suggesting tool’s sequences effective for a given goal
    • execution of the process
    • supporting collaboration all through the design process
           • co-operative design
           • communication among team’s members



10/12/12        Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   6
Non-functional Requirements

   2. Principles on which the platform should be based


                                                                      Requirements
                                                                      • Interoperability
      Main issues
                                                                      • Flexibility
      • heterogeneity
                                                                      • Modularity
      • complexity
                                                                      • Reusability
      • distribution
                                                                      • Transparency
      • coordination
                                                                      • Usability




10/12/12     Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   7
Resources in a KDD project

   3. Identification of resources involved in a KDD process




   • Computational units (gathering, preparation, modeling, visualization, …)
   • Data and Models (dataset, input parameters, intermediate results, final model)
   • Actors (domain experts, DB/DWH administrators, DM and KDD experts)
   • Computational processes

10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   8
Knowledge- and user-centric approach

    Service-Oriented platform
    Basic Services                                                    Support Services
    Services for every KDD phase.                                     • back-end functionalities
    • tools are wrapped as services,                                  • advanced functionalities
    • deployed on a server,
    • published in a common repository
                                                                       Collaborative functionalities
                                                                       • co-operation
                      running as
                                                                       • communication facilities
            tool                   service




    Knowledge Layer
    Data model: systematization of knowledge regarding each KDD resource, from
    both a theoretical and an operational perspective

10/12/12     Emanuele Storti       KDD Process Design in Collaborative and Distributed Environments    9
Knowledge Layer

   Methodology

   • degrees of abstraction for description of resources
   • semantic technologies for their representation
           • formal ontology building methodology [Fernandez, 1997] [Staab, 2001]
           • quality requirements [Gruber, 1995]



     Conceptual: abstract representation                                     Domain ontologies

                                                                             Semantically annotated
     Concrete: actual resources of the platform
                                                                             XML descriptors

     Execution: logs and traces



10/12/12         Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   10
Knowledge Layer > Computational units
                                     [IDA 2009] [ECML 2009]

                                                                         KDDONTO
                                                                     Performance Index


     algorithm

                                                Method                   Algorithm                       Data
              implemented as



                                                              Task                       Phase

       tool
                                     • algorithms arranged in a taxonomy
               running as
                                     • method, task, KDD phase, performances
                                     • data arranged in a taxonomy
                                     • algorithms’ interfaces (I/O, pre/post-conditions)
     service                         •…
                                     Implemented in OWL-DL (ALCOIF expressivity)

10/12/12           Emanuele Storti    KDD Process Design in Collaborative and Distributed Environments          11
Knowledge Layer > Computational units

                                                                                                        <location>
                                        SERVICE                                                         <I/O>
                                      Algorithm alg                                                     <algorithm>
                                      WSDL      url                                                     <performance>
                                      Qos       val                                                     <author>

     algorithm                            UDDI                                                           eSAWSDL

              implemented as         SAWSDL: Semantic annotations for WSDL (w3c)
                                         • location (<xmls:impl>)
                                         • I/O <message>: type, syntax
                                     Further extensions:
       tool                              • implemented <algorithm>
                                         • <performance>
               running as
                                         • <author>
                                         •…

     service                         Descriptors are stored in a UDDI registry


10/12/12           Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments                   12
Knowledge Layer > Computational units

   Mappings between abstraction levels
   • KDDONTO provides a shared vocabulary which services refer to
   • such mappings can support process composition


                                                   KDDONTO
                  Remove missing values              Labeled Dataset                        C4.5
                       algorithm                                                         algorithm




                   <location>                                                       <location>
                   <output>                                                         <input>
                   <algorithm>                                                      <algorithm>
                   <performance>
                   <author>
                                                                       abc          <performance>
                                                                                    <author>
                    eSAWSDL1                                                         eSAWSDL2


10/12/12     Emanuele Storti       KDD Process Design in Collaborative and Distributed Environments   13
Knowledge Layer > Actors

   TeamONTO is a formal ontology devoted to represent details of
   actors [CTS 2012]

                                            TeamONTO


                             hasSkill                             writes
              Algorithm                          Person                               Publication

                             hasSkill                                worksIn


           WebService                                                                Organization
                                 hasSkill                     memberOf



                             Domain                                  Project
                                                 about




10/12/12   Emanuele Storti    KDD Process Design in Collaborative and Distributed Environments      14
Knowledge Layer > Processes

   XML-descriptor for concrete level processes
     •     <services> included in the process
     •     <connection> among their I/O interfaces
     •     <users> in charge of settings service parameters
     •     metadata: creation <date>/<time>, <author>, <comments>


   Process descriptors are stored in a Process Repository

                                        <location>
                                        <I/O>
                                        <algorithm>
                                        <performance>
                                        <author>
           <location>                                              <location>
           <I/O>                        eSAWSDL         2          <I/O>
           <algorithm>                                             <algorithm>
           <performance>                                           <performance>
           <author>                                                <author>
                                        <location>
           eSAWSDL         1            <I/O>                     eSAWSDL          4                                 project
                                        <algorithm>
                                        <performance>
                                        <author>
                                                                                                                    ProcRep    index
                                        eSAWSDL         3




10/12/12              Emanuele Storti            KDD Process Design in Collaborative and Distributed Environments                      15
Knowledge Layer > Processes

   Process Repository indexing method [SEBD 2011][SAC 2012]
   Application of graph-based hierarchical clustering [Jonyer, 2002]
   • extracts the most frequent and common subprocesses
   • arrange them in a lattice




                                                                  hierarchical clustering




                                                                                                  ProcRep
                                            index


10/12/12     Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments             16
Knowledge Layer > Processes

   Process Repository indexing method [SEBD 2011][SAC 2012]
   Application of graph-based hierarchical clustering [Jonyer, 2002]
   • extracts the most frequent and common subprocesses
   • arrange them in a lattice

   Support for process retrieval: subgraph isomorphism on the index is
   more efficient than on the whole repository



                      graph matching




                                                                                                          ProcRep
    User query                                      index


10/12/12         Emanuele Storti       KDD Process Design in Collaborative and Distributed Environments             16
Knowledge Layer > Overview

                              KDDONTO                                                   TeamONTO
                                   Algorithm                                           Algorithm     Person


                           Performance         Data                                   WebService
                                                                                      WebService      Project




                                                 <location>                    <location>
                                                 <I/O>                         <I/O>
                                                 <algorithm>                   <algorithm>
                                                 <performance>                 <performance>
             SERVICE                             <author>                      <author>
           Algorithm alg                         eSAWSDL         1              eSAWSDL 2
           WSDL      url                                                                                           project
           Qos       val                                             Process

             UDDI                                                                                                 ProcRep    index




  Advantages: loose-coupling, modularity, reusability, support to advanced functions

10/12/12            Emanuele Storti            KDD Process Design in Collaborative and Distributed Environments                      17
KDDVM Platform
  Service-oriented platform for KDD process design
  [CTS 2010][CTS 2011][IFS]


   collaborative
   functionalities




      support
      services




10/12/12         Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   18
KDDVM Platform > KDDDesigner

   • integration point of other support services
   • platform front-end




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   19
Service Discovery

   Retrieval of KDD services satisfying user requirements
    •      syntactic service discovery (service name)
    •      semantic service discovery (functionalities, goal, interface,…)
            1. Search, in KDDONTO, for algorithm satisfying the requirements
            2. Search, inside UDDI, for services implementing such algorithms

                                                      KDDONTO
                                                             Algorithm

                                                      Data               Goal




                                     SERVICE                                     <location>
                                                                                 <I/O>
                                   Algorithm alg                                 <algorithm>
                                   WSDL      url                                 <performance>
                                   Qos       val                                 <author>
                                                                                 eSAWSDL
                                      UDDI


10/12/12         Emanuele Storti    KDD Process Design in Collaborative and Distributed Environments   20
Service Discovery

   Retrieval of KDD services satisfying user requirements
    •      syntactic service discovery (service name)
    •      semantic service discovery (functionalities, goal, interface,…)
            1. Search, in KDDONTO, for algorithm satisfying the requirements
            2. Search, inside UDDI, for services implementing such algorithms




10/12/12         Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   21
Process Composition

   Verification of I/O data compatibility




10/12/12   Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   22
Process Composition: matchmaking

   Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]
   • syntactic compatibility: comparison between service descriptors




                                       KDDONTO




           <location>                                                        <location>
           <I/O>                                                             <I/O>
           <algorithm>                                                       <algorithm>
                                     Same format?
           <performance>                                        abc          <performance>
           <author>                  Same syntax?                            <author>
           eSAWSDL1                                                          eSAWSDL2


10/12/12           Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   23
Process Composition: matchmaking

   Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]
   • syntactic compatibility: comparison between service descriptors
   • semantic compatibility: comparison between ontological annotations of the
   services (kind of match between I/O, preconditions/postconditions...)


                                       KDDONTO
                                       same concept?
                            data1      subconcept?               data2                                    Output:
                                       part-of concept?
                                                                                                         match cost

           <location>                                                        <location>
           <I/O>                                                             <I/O>                      match evaluation
           <algorithm>                                                       <algorithm>                   function:
           <performance>                                        abc          <performance>
           <author>                                                          <author>                   • kind of match
           eSAWSDL1                                                          eSAWSDL2                   • preconditions
                                                                                                        • etc…


10/12/12           Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments                     23
Process Composition: matchmaking

   Usage of Matchmaking:
     • provide the user with information about the validity of the match
     • find all services compatible with a given one
     • support an advanced functionality for semi-automatic process
      composition




10/12/12     Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   24
Process Composition: semi-automatic procedure

   Procedure to generate conceptual processes [IDA 2009] [ITAIS 2009] [ECAI 2010]
   • planning technique: goal-driven, backwards strategy
   • basic step based on matchmaking
   • pruning criteria, stop criteria, user constraints
   • results: abstract processes useful as templates, ranked according to a metric




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   25
Collaborative Features

   Collaborative features for process design within a team                                         [CTS 2011]

     1. Team building: to find users with certain skills & build a team
     2. Multi-user Process design and Versioning
     3. Communication
     4. Task assignment


   Retrieval of users with certain competencies, inside TeamONTO




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments                26
Collaborative Features

   Collaborative features for process design within a team                                         [CTS 2011]

     1. Team building: to find users with certain skills & build a team
     2. Multi-user Process design and Versioning
     3. Communication
     4. Task assignment


   Asynchronous edit of a process, and its storage as a new version in the repository




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments                26
Collaborative Features

   Collaborative features for process design within a team                                         [CTS 2011]

     1. Team building: to find users with certain skills & build a team
     2. Multi-user Process design and Versioning
     3. Communication
     4. Task assignment


   Talk page & comments attached to versions




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments                26
Collaborative Features

   Collaborative features for process design within a team                                         [CTS 2011]

     1. Team building: to find users with certain skills & build a team
     2. Multi-user Process design and Versioning
     3. Communication
     4. Task assignment


   Assignment to a user of the team the parameter setting for the service execution




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments                26
Formal Verification of Processes

   1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
            • language for representation of interaction among services
            • formal and compositional semantics
   2. Specification of the desider behavior through mCRL2 [Groote, 2006]
   3. Model-checking to verify whether the process satisfies the behavior




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   27
Formal Verification of Processes

   1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
            • language for representation of interaction among services
            • formal and compositional semantics
   2. Specification of the desider behavior through mCRL2 [Groote, 2006]
   3. Model-checking to verify whether the process satisfies the behavior




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   27
Formal Verification of Processes

   1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
            • language for representation of interaction among services
            • formal and compositional semantics
   2. Specification of the desider behavior through mCRL2 [Groote, 2006]
   3. Model-checking to verify whether the process satisfies the behavior


   Purposes [SEBD 2010]
   • formal verification of KDD processes at design-time
   • reuse of typical pre-verified KDD subprocesses




10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   27
Conclusion

   KDD process design in distributed and collaborative environments

    • Distributed and heterogeneous settings:
           • systematic use of semantic information all through the process
           • loosely-coupled service architecture

    • Community-centered approach:
           • support functionalities
           • users with various degrees of experience

    • General-purpose attitude:
           • domain-indipendence
           • generalization to experimental processes in e-Science




10/12/12         Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments   28
References
   [Fayyad, 1996] “From data mining to knowledge discovery: an overview”, American
     Association for Artificial Intelligence.
   [Ali, 2005] “Web services composition for distributed data mining”, Parallel Processing
   [Kgrid, 2003] “The knowledge grid”, Comm. ACM
   [Bernstein, 2005] “Towards Intelligent Assistance for a Data Mining Process: An
     Ontology Based Approach for Cost-Sensitive Classification”, IEEE TKDE
   [Žáková, 2011] “Automating Knowledge Discovery Workflow Composition Through
     Ontology-Based Planning”, IEEE T. Automation Science and Engineering
   [myExperiment, 2006] “The Design and Realisation of the Virtual Research Environment
     for Social Sharing of Workflows”, FGCS
   [Fernandez, 1997] “METHONTOLOGY: from Ontological Art towards Ontological
     Engineering”, Proc. of the AAAI97 Spring Symposium.
   [Staab, 2001] “Knowledge Processes and Ontologies”, IEEE Intelligent Systems.
   [Gruber, 1995] “Toward principles for the design of ontologies used for knowledge
     sharing”, Int. J. Hum.-Comput. Stud.
   [Jonyer, 2002] “Graph-based hierarchical conceptual clustering”, J. Mach. Learn. Res.
   [Groote, 2006] “The formal specification language mcrl2”, in MMOSS.
   [Arbab, 2004] “Reo: a channel-based coordination model for component composition”,
     Mathematical. Structures in Comp. Sci.



10/12/12       Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments
Publications
   [IDA 2009] “Ontology-driven KDD Process Composition”. LNCS, Springer, 2009.
   [ECML 2009] “KDDONTO: an Ontology for Discovery and Composition of KDD
     Algorithms”. ECML/PKDD09 Workshop.
   [ITAIS 2009] “Automatic Definition of KDD Prototype Process by Composition”.
     “Management of Interconnected World”, Springer.
   [CTS 2010] “Semantic-Driven Design and Management of KDD Processes”. CTS 2010,
     IEEE.
   [SEBD 2010] “Towards Coordination Patterns for Complex Experimentations in Data
     Mining”. SEBD 2010.
   [ECAI 2010] “Supporting Users in KDD Process Design. A Semantic Similarity Matching
     Approach”. Planning To Learn Workshop in ECAI2010.
   [CTS 2011] “Semantic-Aided Designer for Knowledge Discovery”. CTS 2011, IEEE.
   [SEBD 2011] “Clustering of process schema by graph mining techniques”. SEBD 2011.
   [SAC 2012] “Mining Usage Patterns from a Repository of Scientific Workflows”.
     SAC2012, ACM.
   [CTS 2012] “Semantically-supported Team Building in a KDD Virtual Environment”. CTS
     2012, IEEE.
   [ISF] “A Virtual Mart for Knowledge Discovery in Databases”. Information System
     Frontiers, Springer, accepted.



10/12/12      Emanuele Storti   KDD Process Design in Collaborative and Distributed Environments

Weitere ähnliche Inhalte

Kürzlich hochgeladen

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Kdd Process Design in Collaborative and Distributed Environments

  • 1. Università Politecnica delle Marche Scuola di dottorato in Scienze dell’Ingegneria Curriculum in Ingegneria Informatica, Gestionale e dell’Automazione KDD Process Design in Collaborative and Distributed Environments Emanuele Storti Advisor: Prof.ssa Claudia Diamantini Curriculum supervisor: Prof. Sauro Longhi 28 Febbraio 2012
  • 2. Summary I. Introduction • Background & Motivation • Related Work • Research Question II. Approach III. Knowledge Layer IV. Platform • Service Discovery • Process Composition • Collaboration features • Formal Verification of processes V. Conclusion 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments
  • 3. Introduction “Knowledge is the common wealth of humanity”* Access to vast collections of data is the key to understanding and responding to complex problems Growing capability in data production: ge Data Delu • availability of cost-effective technologies • enhancement of communication infrastructures Not only organizations, also e-Science • global problems ask for data-intensive computation (genomics, physics, climatology) • global effort through world-wide scientific collaboration Need of tools for supporting distribution and collaboration in data analysis * A. Samassekou, UN World Summit on the Information Society 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 1
  • 4. Background Knowledge Discovery in Databases (KDD) Process of identifying valid, novel, potentially useful patterns in data [Fayyad, 1996] • knowledge refinement • many steps, several iterations • strong user interaction • need of user knowledge Support to: • decision making in organizations • e-Science experimentations 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 2
  • 5. Related Work Early proposals (Intelligent Data Analysis systems): • local frameworks • single-user • predefined set of tools (little extensibility) Distribution of tools & computational aspects: • SOA [Ali, 2005] [Kumar, 2005] • Grid [Kgrid, 2003] • OGSA: SOA + Grid Distribution of users & evolution of organizations: • user support in advanced activities [Bernstein, 2005] [Žáková, 2011] • collaboration [myExperiment, 2009] 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 3
  • 6. Motivation Main issues in open, distributed and collaborative scenarios • Localization of many available distributed tools • Integration of heterogeneous tools (interfaces, programming languages, OSs, transfer protocols,…) • Complexity in their usage (data preparation, I/O interpretation, precondition satisfaction, process design), many possible combinations • Coordination in cooperative work (source of complexity) 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 4
  • 7. Research Question How to support a community of users in the design of a KDD project in an open, heterogeneous, distributed and collaborative scenario? 1. Which kind of support and functionalities should a platform offer? 2. Which principles should the platform be based on? 3. Which resources are involved in a KDD process and how to represent them? 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 5
  • 8. Functional Requirements 1. Support functionalities that should be offered by the platform: • importing new heterogeneous tools for several KDD tasks • retrieving tools useful for certain purposes • designing a process by connecting tools together • understanding whether such connections are meaningful • suggesting tool’s sequences effective for a given goal • execution of the process • supporting collaboration all through the design process • co-operative design • communication among team’s members 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 6
  • 9. Non-functional Requirements 2. Principles on which the platform should be based Requirements • Interoperability Main issues • Flexibility • heterogeneity • Modularity • complexity • Reusability • distribution • Transparency • coordination • Usability 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 7
  • 10. Resources in a KDD project 3. Identification of resources involved in a KDD process • Computational units (gathering, preparation, modeling, visualization, …) • Data and Models (dataset, input parameters, intermediate results, final model) • Actors (domain experts, DB/DWH administrators, DM and KDD experts) • Computational processes 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 8
  • 11. Knowledge- and user-centric approach Service-Oriented platform Basic Services Support Services Services for every KDD phase. • back-end functionalities • tools are wrapped as services, • advanced functionalities • deployed on a server, • published in a common repository Collaborative functionalities • co-operation running as • communication facilities tool service Knowledge Layer Data model: systematization of knowledge regarding each KDD resource, from both a theoretical and an operational perspective 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 9
  • 12. Knowledge Layer Methodology • degrees of abstraction for description of resources • semantic technologies for their representation • formal ontology building methodology [Fernandez, 1997] [Staab, 2001] • quality requirements [Gruber, 1995] Conceptual: abstract representation Domain ontologies Semantically annotated Concrete: actual resources of the platform XML descriptors Execution: logs and traces 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 10
  • 13. Knowledge Layer > Computational units [IDA 2009] [ECML 2009] KDDONTO Performance Index algorithm Method Algorithm Data implemented as Task Phase tool • algorithms arranged in a taxonomy running as • method, task, KDD phase, performances • data arranged in a taxonomy • algorithms’ interfaces (I/O, pre/post-conditions) service •… Implemented in OWL-DL (ALCOIF expressivity) 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 11
  • 14. Knowledge Layer > Computational units <location> SERVICE <I/O> Algorithm alg <algorithm> WSDL url <performance> Qos val <author> algorithm UDDI eSAWSDL implemented as SAWSDL: Semantic annotations for WSDL (w3c) • location (<xmls:impl>) • I/O <message>: type, syntax Further extensions: tool • implemented <algorithm> • <performance> running as • <author> •… service Descriptors are stored in a UDDI registry 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 12
  • 15. Knowledge Layer > Computational units Mappings between abstraction levels • KDDONTO provides a shared vocabulary which services refer to • such mappings can support process composition KDDONTO Remove missing values Labeled Dataset C4.5 algorithm algorithm <location> <location> <output> <input> <algorithm> <algorithm> <performance> <author> abc <performance> <author> eSAWSDL1 eSAWSDL2 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 13
  • 16. Knowledge Layer > Actors TeamONTO is a formal ontology devoted to represent details of actors [CTS 2012] TeamONTO hasSkill writes Algorithm Person Publication hasSkill worksIn WebService Organization hasSkill memberOf Domain Project about 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 14
  • 17. Knowledge Layer > Processes XML-descriptor for concrete level processes • <services> included in the process • <connection> among their I/O interfaces • <users> in charge of settings service parameters • metadata: creation <date>/<time>, <author>, <comments> Process descriptors are stored in a Process Repository <location> <I/O> <algorithm> <performance> <author> <location> <location> <I/O> eSAWSDL 2 <I/O> <algorithm> <algorithm> <performance> <performance> <author> <author> <location> eSAWSDL 1 <I/O> eSAWSDL 4 project <algorithm> <performance> <author> ProcRep index eSAWSDL 3 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 15
  • 18. Knowledge Layer > Processes Process Repository indexing method [SEBD 2011][SAC 2012] Application of graph-based hierarchical clustering [Jonyer, 2002] • extracts the most frequent and common subprocesses • arrange them in a lattice hierarchical clustering ProcRep index 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 16
  • 19. Knowledge Layer > Processes Process Repository indexing method [SEBD 2011][SAC 2012] Application of graph-based hierarchical clustering [Jonyer, 2002] • extracts the most frequent and common subprocesses • arrange them in a lattice Support for process retrieval: subgraph isomorphism on the index is more efficient than on the whole repository graph matching ProcRep User query index 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 16
  • 20. Knowledge Layer > Overview KDDONTO TeamONTO Algorithm Algorithm Person Performance Data WebService WebService Project <location> <location> <I/O> <I/O> <algorithm> <algorithm> <performance> <performance> SERVICE <author> <author> Algorithm alg eSAWSDL 1 eSAWSDL 2 WSDL url project Qos val Process UDDI ProcRep index Advantages: loose-coupling, modularity, reusability, support to advanced functions 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 17
  • 21. KDDVM Platform Service-oriented platform for KDD process design [CTS 2010][CTS 2011][IFS] collaborative functionalities support services 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 18
  • 22. KDDVM Platform > KDDDesigner • integration point of other support services • platform front-end 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 19
  • 23. Service Discovery Retrieval of KDD services satisfying user requirements • syntactic service discovery (service name) • semantic service discovery (functionalities, goal, interface,…) 1. Search, in KDDONTO, for algorithm satisfying the requirements 2. Search, inside UDDI, for services implementing such algorithms KDDONTO Algorithm Data Goal SERVICE <location> <I/O> Algorithm alg <algorithm> WSDL url <performance> Qos val <author> eSAWSDL UDDI 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 20
  • 24. Service Discovery Retrieval of KDD services satisfying user requirements • syntactic service discovery (service name) • semantic service discovery (functionalities, goal, interface,…) 1. Search, in KDDONTO, for algorithm satisfying the requirements 2. Search, inside UDDI, for services implementing such algorithms 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 21
  • 25. Process Composition Verification of I/O data compatibility 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 22
  • 26. Process Composition: matchmaking Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010] • syntactic compatibility: comparison between service descriptors KDDONTO <location> <location> <I/O> <I/O> <algorithm> <algorithm> Same format? <performance> abc <performance> <author> Same syntax? <author> eSAWSDL1 eSAWSDL2 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 23
  • 27. Process Composition: matchmaking Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010] • syntactic compatibility: comparison between service descriptors • semantic compatibility: comparison between ontological annotations of the services (kind of match between I/O, preconditions/postconditions...) KDDONTO same concept? data1 subconcept? data2 Output: part-of concept? match cost <location> <location> <I/O> <I/O> match evaluation <algorithm> <algorithm> function: <performance> abc <performance> <author> <author> • kind of match eSAWSDL1 eSAWSDL2 • preconditions • etc… 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 23
  • 28. Process Composition: matchmaking Usage of Matchmaking: • provide the user with information about the validity of the match • find all services compatible with a given one • support an advanced functionality for semi-automatic process composition 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 24
  • 29. Process Composition: semi-automatic procedure Procedure to generate conceptual processes [IDA 2009] [ITAIS 2009] [ECAI 2010] • planning technique: goal-driven, backwards strategy • basic step based on matchmaking • pruning criteria, stop criteria, user constraints • results: abstract processes useful as templates, ranked according to a metric 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 25
  • 30. Collaborative Features Collaborative features for process design within a team [CTS 2011] 1. Team building: to find users with certain skills & build a team 2. Multi-user Process design and Versioning 3. Communication 4. Task assignment Retrieval of users with certain competencies, inside TeamONTO 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
  • 31. Collaborative Features Collaborative features for process design within a team [CTS 2011] 1. Team building: to find users with certain skills & build a team 2. Multi-user Process design and Versioning 3. Communication 4. Task assignment Asynchronous edit of a process, and its storage as a new version in the repository 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
  • 32. Collaborative Features Collaborative features for process design within a team [CTS 2011] 1. Team building: to find users with certain skills & build a team 2. Multi-user Process design and Versioning 3. Communication 4. Task assignment Talk page & comments attached to versions 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
  • 33. Collaborative Features Collaborative features for process design within a team [CTS 2011] 1. Team building: to find users with certain skills & build a team 2. Multi-user Process design and Versioning 3. Communication 4. Task assignment Assignment to a user of the team the parameter setting for the service execution 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
  • 34. Formal Verification of Processes 1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004] • language for representation of interaction among services • formal and compositional semantics 2. Specification of the desider behavior through mCRL2 [Groote, 2006] 3. Model-checking to verify whether the process satisfies the behavior 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
  • 35. Formal Verification of Processes 1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004] • language for representation of interaction among services • formal and compositional semantics 2. Specification of the desider behavior through mCRL2 [Groote, 2006] 3. Model-checking to verify whether the process satisfies the behavior 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
  • 36. Formal Verification of Processes 1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004] • language for representation of interaction among services • formal and compositional semantics 2. Specification of the desider behavior through mCRL2 [Groote, 2006] 3. Model-checking to verify whether the process satisfies the behavior Purposes [SEBD 2010] • formal verification of KDD processes at design-time • reuse of typical pre-verified KDD subprocesses 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
  • 37. Conclusion KDD process design in distributed and collaborative environments • Distributed and heterogeneous settings: • systematic use of semantic information all through the process • loosely-coupled service architecture • Community-centered approach: • support functionalities • users with various degrees of experience • General-purpose attitude: • domain-indipendence • generalization to experimental processes in e-Science 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 28
  • 38. References [Fayyad, 1996] “From data mining to knowledge discovery: an overview”, American Association for Artificial Intelligence. [Ali, 2005] “Web services composition for distributed data mining”, Parallel Processing [Kgrid, 2003] “The knowledge grid”, Comm. ACM [Bernstein, 2005] “Towards Intelligent Assistance for a Data Mining Process: An Ontology Based Approach for Cost-Sensitive Classification”, IEEE TKDE [Žáková, 2011] “Automating Knowledge Discovery Workflow Composition Through Ontology-Based Planning”, IEEE T. Automation Science and Engineering [myExperiment, 2006] “The Design and Realisation of the Virtual Research Environment for Social Sharing of Workflows”, FGCS [Fernandez, 1997] “METHONTOLOGY: from Ontological Art towards Ontological Engineering”, Proc. of the AAAI97 Spring Symposium. [Staab, 2001] “Knowledge Processes and Ontologies”, IEEE Intelligent Systems. [Gruber, 1995] “Toward principles for the design of ontologies used for knowledge sharing”, Int. J. Hum.-Comput. Stud. [Jonyer, 2002] “Graph-based hierarchical conceptual clustering”, J. Mach. Learn. Res. [Groote, 2006] “The formal specification language mcrl2”, in MMOSS. [Arbab, 2004] “Reo: a channel-based coordination model for component composition”, Mathematical. Structures in Comp. Sci. 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments
  • 39. Publications [IDA 2009] “Ontology-driven KDD Process Composition”. LNCS, Springer, 2009. [ECML 2009] “KDDONTO: an Ontology for Discovery and Composition of KDD Algorithms”. ECML/PKDD09 Workshop. [ITAIS 2009] “Automatic Definition of KDD Prototype Process by Composition”. “Management of Interconnected World”, Springer. [CTS 2010] “Semantic-Driven Design and Management of KDD Processes”. CTS 2010, IEEE. [SEBD 2010] “Towards Coordination Patterns for Complex Experimentations in Data Mining”. SEBD 2010. [ECAI 2010] “Supporting Users in KDD Process Design. A Semantic Similarity Matching Approach”. Planning To Learn Workshop in ECAI2010. [CTS 2011] “Semantic-Aided Designer for Knowledge Discovery”. CTS 2011, IEEE. [SEBD 2011] “Clustering of process schema by graph mining techniques”. SEBD 2011. [SAC 2012] “Mining Usage Patterns from a Repository of Scientific Workflows”. SAC2012, ACM. [CTS 2012] “Semantically-supported Team Building in a KDD Virtual Environment”. CTS 2012, IEEE. [ISF] “A Virtual Mart for Knowledge Discovery in Databases”. Information System Frontiers, Springer, accepted. 10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments