Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Anr cair meeting feb 2016

307 Aufrufe

Veröffentlicht am

Talk given at the plenary meeting of the CAIR ANR project in Lyon

Veröffentlicht in: Bildung
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Anr cair meeting feb 2016

  1. 1. Khalid Belhajjame Joint work with Daniela Grigori, Mariem Harmassi et Manel BenYahia LAMSADE, Université Paris Dauphine Keyword-Based Search of Workflow Fragments and their Composition
  2. 2. Workflows A business process specified using the BPMN notation A ScientificWorkflow system (Taverna) A workflow consists of an orchestrated and repeatable pattern of business activity enabled by the systematic organization of resources into processes that transform materials, provide services, or process information (Workflow Coalition) CAIR, Lyon2 18 Feb 2016
  3. 3. Workflows are difficult to design —  The design of scientific workflows, just like business process, can be a difficult task —  Deep knowledge of the domain —  Awareness of the resources, e.g., programs and web services, that can enact the steps of the workflow —  Publish and share workflows, and promote their reuse. —  myExperiment, CrowldLab, Galaxy, and other various business process repository —  Reuse is still an aim. —  There are no capabilities that support the user in identifying the workflows, or fragments thereof, that are relevant for the task at hand. CAIR, Lyon4 18 Feb 2016
  4. 4. Assisting designers of workflows through reuse CAIR, Lyon5 —  There has been a good body of work in the literature on workflow discover. —  The typical approach requires the user to specify a query in the form of a workflow that is compared against a collection of workflows. —  In our work, we focus on a problem that is different and that received little attention. —  We position ourselves in a context where a designer is specifying his/her workflow and needs assistance to design parts of her workflow (which we term fragments). 18 Feb 2016
  5. 5. Approach CAIR, Lyon6 18 Feb 2016
  6. 6. Mining Frequent Workflow Fragments 18 Feb 2016CAIR, Lyon7 Given a repository of workflows, we would like to mine frequent workflow fragments, i.e., fragments that are used across multiple workflows.This raises two questions: —  How to deal with the heterogeneity of the labels used by different users to model the activities of their workflows within the repository? —  Which graph representation is best suited for formatting workflows for mining frequent fragments?
  7. 7. Homogenizing Activity Labels 18 Feb 2016CAIR, Lyon8 —  Activities implementing the same functionality should have the same names. —  Workflow authored by different people are likely to use different labels for naming activities that perform the same task. —  To homogenize activity labels we use state of the art techniques, which rely on a common dictionary for: —  Identifying the activities that perform the same task, and —  Selecting the same label for naming them.
  8. 8. Detecting Recurrent Fragments —  We use graph mining algorithms to identify the fragments in the repository that are recurrent. —  We use the SUBDUE algorithm. —  Which graph representation to use to represent (workflow) fragments? —  We examined a number of workflow representation CAIR, Lyon12 18 Feb 2016
  9. 9. Representation A att1 att2 att3 att4 att5 next operator And operator sequence next operand operator Xor type type operand next operand typeoperand operand Representation B att1 att2 att3 att4 att5 next Split-And next Join-Xor J-Xor sequence next sp-and sp-and CAIR, Lyon13 18 Feb 2016
  10. 10. Representation C att2 att3 att4 att5 att1 S-att1-att2 S-att1-att3 seq-att2-att4 seq-att4-att5 att2 att3 att5 att1 S-att1-att2 S-att1-att3 seq-att3-att5 CAIR, Lyon14 18 Feb 2016
  11. 11. att1 att2 att3 att4 att5 And_att1_att3 And_att1_att2 XOR_att3_att5 SEQ_att2_att4 XOR_att4_att5 Representation D Representation D1 att1 att2 att3 att4 att5 And And XOR SEQ XOR CAIR, Lyon15 18 Feb 2016
  12. 12. Experiment l Objective:To assess the suitability of the graph representations for mining workflow graphs l Effectiveness : Precision/ Recall l Size of the graphs l Execution time CAIR, Lyon16 18 Feb 2016
  13. 13. Dataset —  We created three datasets of workflow specifications, containing respectively 30, 42, and 71 workflows. —  9 out of these workflows are similar to each other and, as such contain recurrent structures, that should be detected by the mining algorithm. —  Despite the small size of the collection, these datasets allowed to distinguish to a certain extent between the different representations. CAIR, Lyon17 18 Feb 2016
  14. 14. Experiment results: Input Data size CAIR, Lyon18 18 Feb 2016
  15. 15. Experiment: Effectiveness (Precision/ Recall) CAIR, Lyon19 Precision = #Truepositives #Truepositives + #Falsepositives Recall = #Truepositives #Truepositives + #Falsenegatives 18 Feb 2016
  16. 16. Representation A att1 att2 att3 att4 att5 next operator And operator sequence next operand operator Xor type type operand next operand typeoperand operand Representation B att1 att2 att3 att4 att5 next Split-And next Join-Xor J-Xor sequence next sp-and sp-and CAIR, Lyon20 18 Feb 2016
  17. 17. Experiment1: Effectiveness (Precision/ Recall) CAIR, Lyon21 Precision = #Truepositives #Truepositives + #Falsepositives Recall = #Truepositives #Truepositives + #Falsenegatives 18 Feb 2016
  18. 18. Approach CAIR, Lyon24 18 Feb 2016
  19. 19. Keyword-Based Search for frequent workflow fragments 18 Feb 2016CAIR, Lyon25 Given an initial workflow, the user issues a query to characterize the desired missing workflow fragment.The query is composed of two elements: 1.  A set of keywords {kw1,…,kwn} characterizing the functionality that the fragment should implement 2.  The set of activities in the initial workflow that are to be connected to the fragment in question. Acommon = {a1,…,am},We call these activities common or joint activities, interchangeably.
  20. 20. Retrieving Relevant Fragments 18 Feb 2016CAIR, Lyon26 —  The first step in processing the user query consists in identifying the fragments in the repository that are relevant given the specified keywords. —  In doing so, we adopt the widely used technique ofTF/IDF (term frequency / inverse document frequency) measure. —  Note however that workflow fragments are indexed by the labels of the activities obtained by homogenization, which we refer to by IDX —  Whereas, the user uses keywords of his own choosing
  21. 21. Retrieving relevant Fragments 18 Feb 2016CAIR, Lyon27 —  For each keyword kwi provided by the user, we retrieve its synonyms, syn(kwi), from an existing thesaurus, e.g., Wordnet —  The index IDX is trawled to identify if there is a term in syn(kwi) U kwi, which is can be used to represent the keyword in the vector used for retrieving fragments. —  Once the fragments are retrieved, their constitent activities are matched against the common activities in the initial workflow that the user identified
  22. 22. Ranking Fragments 18 Feb 2016CAIR, Lyon28 The score used for ranking candidate fragments given a user query uq is defined as follows: —  Relevance: denotes the score obtained usingTF-IDF —  Frequency:A workflow fragment that is frequently used across multiple workflow is likely to implement best practices that are useful for the designer compared with a fragment that apperas in, say one workflow Compatibility(wffrg, uq.Acommon) = aj 2Acommon matchingScore(aj, wffrg) |Acommon| where matchingScore(aj, wffrg) is the matching score between aj and the best matching activity in wffrg, and |Acommon| the number of activities in Acommon. Compatibility(wffrg, Acommon) takes a value between 0 and 1. The larger is the number of activities Acommon that have a matching activity in wffrg and the higher are the matching scores, the higher is the compati- bility between the candidate workflow fragment and the initial workflow. Based on the above factors, we define the score used for ranking candidate fragments given a user query uq as: Score(wffrg, uq) = wr.Relvance(wffrg, uq) + wf . F requency(wffrg) MaxF requency + wc.Compatibility(wffrg, uq.Acommon) where wr, wf and wc are positive real numbers representing weights such that wr + wf + wc = 1. MaxFrequency is positive number denoting the frequency of the fragment that appears the maximum of times in the workflow repository harvested. Notice that the score takes a value between 0 and 1. Once the can- didate fragments are scored, they are ranked in the descendant order of their associated scores. The top-k fragments, e.g., 5, are then presented to the designer who selects to the one to be composed with the initial workflow. 5 Composing Workflow Fragments Once the user has examined the fragments that are retrieved given the set of keywords s/he specified, s/he can choose a fragment to be composed with
  23. 23. Ranking Fragment: Compatibility 18 Feb 2016CAIR, Lyon29 To estimate the compatibility, we consider the number of activities in Acommon that have a matching activity in the workflow fragment and their associated matching score. ated matching score. Specifically, we define the compatibility of a workflow fragment given the activities uq.Acommon specified in the user query uq as follows: Compatibility(wffrg, uq.Acommon) = P aj 2Acommon matchingScore(aj, wffrg) |Acommon| where matchingScore(aj, wffrg) is the matching score between aj and the best matching activity in wffrg, and |Acommon| the number of activities in Acommon. Compatibility(wffrg, Acommon) takes a value between 0 and 1. The larger is the number of activities Acommon that have a matching activity in wffrg and the higher are the matching scores, the higher is the compati- bility between the candidate workflow fragment and the initial workflow. Based on the above factors, we define the score used for ranking candidate fragments given a user query uq as: Score(wffrg, uq) = wr.Relvance(wffrg, uq) + wf . F requency(wffrg) MaxF requency + wc.Compatibility(wffrg, uq.Acommon)
  24. 24. Composing Workflow Fragments 18 Feb 2016CAIR, Lyon30 configurable split operator to connect the two workflows as illustrated in Figure 10 (lines 9 and 10). We call such an operator configurable because it is up to the user to choose if such an operator is of a type and split, or split or xor split. The preceding activity of such an operator is acommon and its succeeding activities are the succeeding activities of wfinitial and wffrg. Fig. 8: Merging the initial workflow and a fragment in sequence based on the common activity acommon. Fig. 9: Merging the preceding activities of the initial workflow and a fragment using a configurable join control operator. Fig. 10: Merging the succeeding activities of the initial workflow and a fragment Fig. 9: Merging the preceding activities of the initial workflow and a fragment using a configurable join control operator. Fig. 10: Merging the succeeding activities of the initial workflow and a fragment using a configurable split control operator.
  25. 25. Example from eScience 18 Feb 2016CAIR, Lyon31 To assist the designer is specifying a workflow for profiling genes that are suspected to be involved in a genetic disease, in this case the huntignton disease. people. Although the genetic mutation that causes HD was identified 20 years ago [19], the downstream molecular mechanisms leading to the HD phenotype are still poorly understood. Relating alterations in gene expression to epigenetic information might shed light on the disease aetiology. With this in mind, the scientist wanted to specify a workflow that annotate HD gene expression data with publicly available epigenetic data [20]. Fig. 11: Initial Huntington Disease profiling workflow. The initial workflow specified by the designer is illustrated in Figure 11. The workflow contains three activities that are ordered in sequence. The first activity MapGeneT oConcept is used to extract the concepts that are used in domain ontologies to characterize a given gene that is known or suspected to be involved in an infection or disease, in this case the Huntigton Disease. The second activity IndentifySimilarConcepts is then used to identify for each of those concept the concepts that are similar. The rational behind doing so is that genes that are associated with concepts that are similar to concepts that are involved in the disease have chances of being also involved in the disease. This activity involves running ontological similarity tests to identify similar concepts. The third activity GetConceptIds is then used to extract the identifiers of the similar concepts. keywords = {concept, score, rank, profile} the common activity in the initial workflow, GetConceptIds. Still, our method was able detect it as a common activity thanks to the string similarity utilized. Fig. 12: Candidate fragment selected by the workflow designer. By applying the composition algorithm presented in Section 5 to merge the initial workflow and the fragment selected by the user, we obtained the workflow depicted in Figure 13. Indeed, the common activity has no succeeding activity in the initial workflow and has no preceding activity in the workflow fragment (lines 1, 2 in Figure 7). Therefore, the two workflows are connected ins sequence using the common activity.
  26. 26. Example from eScience 18 Feb 2016CAIR, Lyon32 To assist the designer is specifying a workflow for profiling genes that are suspected to be involved in a genetic disease, in this case the huntignton disease. people. Although the genetic mutation that causes HD was identified 20 years ago [19], the downstream molecular mechanisms leading to the HD phenotype are still poorly understood. Relating alterations in gene expression to epigenetic information might shed light on the disease aetiology. With this in mind, the scientist wanted to specify a workflow that annotate HD gene expression data with publicly available epigenetic data [20]. Fig. 11: Initial Huntington Disease profiling workflow. The initial workflow specified by the designer is illustrated in Figure 11. The workflow contains three activities that are ordered in sequence. The first activity MapGeneT oConcept is used to extract the concepts that are used in domain ontologies to characterize a given gene that is known or suspected to be involved in an infection or disease, in this case the Huntigton Disease. The second activity IndentifySimilarConcepts is then used to identify for each of those concept the concepts that are similar. The rational behind doing so is that genes that are associated with concepts that are similar to concepts that are involved in the disease have chances of being also involved in the disease. This activity involves running ontological similarity tests to identify similar concepts. The third activity GetConceptIds is then used to extract the identifiers of the similar concepts. keywords = {concept, score, rank, profile} Fig. 13: Workflow obtained by merging the initial workflow and selected frag- ment.
  27. 27. Conclusions 18 Feb 2016CAIR, Lyon33 1.  The method we have described for mining workflow fragments have been empirically tested. 2.  The method for retrieving and composing workflow fragments has been used in an eScience use case to assist the design of a Huntington Disease workflow 3.  The work on mining frequent workflow fragments has been published in the International Keystone conference 4.  The extended version with results on the retrieval and composition of workflow fragments has been submitted to a special issue on keyword search of the Springer Journal: Transactions on Computational Collective Intelligence
  28. 28. Khalid Belhajjame Joint work with Daniela Grigori, Mariem Harmassi et Manel BenYahia LAMSADE, Université Paris Dauphine Keyword-Based Search of Workflow Fragments and their Composition

×