Anr cair meeting feb 2016

Khalid Belhajjame
Joint work with
Daniela Grigori, Mariem Harmassi et Manel BenYahia
LAMSADE, Université Paris Dauphine
Keyword-Based Search of Workflow
Fragments and their Composition

Workflows
A business process specified using the
BPMN notation
A ScientificWorkflow system
(Taverna)
A workflow consists of an orchestrated and repeatable pattern of business activity enabled
by the systematic organization of resources into processes that transform materials, provide
services, or process information (Workflow Coalition)
CAIR, Lyon2 18 Feb 2016

Workflows are difficult to design
—  The design of scientific workflows, just like business process,
can be a difficult task
—  Deep knowledge of the domain
—  Awareness of the resources, e.g., programs and web
services, that can enact the steps of the workflow
—  Publish and share workflows, and promote their reuse.
—  myExperiment, CrowldLab, Galaxy, and other various business
process repository
—  Reuse is still an aim.
—  There are no capabilities that support the user in identifying the
workflows, or fragments thereof, that are relevant for the task
at hand.

Assisting designers of workflows
through reuse
CAIR, Lyon5
—  There has been a good body of work in the literature on
workflow discover.
—  The typical approach requires the user to specify a query in the
form of a workflow that is compared against a collection of
workflows.
—  In our work, we focus on a problem that is different and that
received little attention.
—  We position ourselves in a context where a designer is
specifying his/her workflow and needs assistance to design
parts of her workflow (which we term fragments).
18 Feb 2016

Approach

Mining Frequent Workflow Fragments
18 Feb 2016CAIR, Lyon7
Given a repository of workflows, we would like to mine
frequent workflow fragments, i.e., fragments that are used
across multiple workflows.This raises two questions:
—  How to deal with the heterogeneity of the labels used by
different users to model the activities of their workflows
within the repository?
—  Which graph representation is best suited for formatting
workflows for mining frequent fragments?

Homogenizing Activity Labels
—  Activities implementing the same functionality should have
the same names.
—  Workflow authored by different people are likely to use
different labels for naming activities that perform the same
task.
—  To homogenize activity labels we use state of the art
techniques, which rely on a common dictionary for:
—  Identifying the activities that perform the same task, and
—  Selecting the same label for naming them.

Detecting Recurrent Fragments
—  We use graph mining algorithms to identify the fragments in
the repository that are recurrent.
—  We use the SUBDUE algorithm.
—  Which graph representation to use to represent (workflow)
fragments?
—  We examined a number of workflow representation

Representation A
att1
att2 att3
att4
att5
next
operator
And
operator
sequence
next
operand
operator
Xor
type
type
operand
next
operand
typeoperand
operand
Representation B
att1
att2
att3
att4
att5
next
Split-And
next
Join-Xor
J-Xor
sequence
next
sp-and
sp-and

Representation C
att2
att3
att4
att5
att1
S-att1-att2 S-att1-att3
seq-att2-att4
seq-att4-att5
att2
att3
att5
att1
S-att1-att2 S-att1-att3
seq-att3-att5

att1
att2
att3
att4
att5
And_att1_att3
And_att1_att2
XOR_att3_att5
SEQ_att2_att4
XOR_att4_att5
Representation D Representation D1
att1
att2
att3
att4
att5
And
And
XOR
SEQ
XOR

Experiment
l Objective:To assess the suitability of the graph
representations for mining workflow graphs
l Effectiveness : Precision/ Recall
l Size of the graphs
l Execution time

Dataset
—  We created three datasets of workflow specifications,
containing respectively 30, 42, and 71 workflows.
—  9 out of these workflows are similar to each other and, as
such contain recurrent structures, that should be detected by
the mining algorithm.
—  Despite the small size of the collection, these datasets
allowed to distinguish to a certain extent between the
different representations.

Experiment results:
Input Data size

Experiment:
Effectiveness (Precision/ Recall)
CAIR, Lyon19
Precision =
#Truepositives
#Truepositives + #Falsepositives
Recall =
#Truepositives
#Truepositives + #Falsenegatives
18 Feb 2016

Representation A
att1
att2 att3
att4
att5
next
operator
And
operator
sequence
next
operand
operator
Xor
type
type
operand
next
operand
typeoperand
operand
Representation B
att1
att2
att3
att4
att5
next
Split-And
next
Join-Xor
J-Xor
sequence
next
sp-and
sp-and

Experiment1:
Effectiveness (Precision/ Recall)
CAIR, Lyon21
Precision =
#Truepositives
#Truepositives + #Falsepositives
Recall =
#Truepositives
#Truepositives + #Falsenegatives
18 Feb 2016

Approach

Keyword-Based Search for frequent
workflow fragments
Given an initial workflow, the user issues a query to
characterize the desired missing workflow fragment.The query
is composed of two elements:
1.  A set of keywords {kw1,…,kwn} characterizing the
functionality that the fragment should implement
2.  The set of activities in the initial workflow that are to be
connected to the fragment in question.
Acommon = {a1,…,am},We call these activities common or
joint activities, interchangeably.

Retrieving Relevant Fragments
—  The first step in processing the user query consists in
identifying the fragments in the repository that are relevant
given the specified keywords.
—  In doing so, we adopt the widely used technique ofTF/IDF
(term frequency / inverse document frequency) measure.
—  Note however that workflow fragments are indexed by the
labels of the activities obtained by homogenization, which we
refer to by IDX
—  Whereas, the user uses keywords of his own choosing

Retrieving relevant Fragments
—  For each keyword kwi provided by the user, we retrieve its
synonyms, syn(kwi), from an existing thesaurus, e.g.,
Wordnet
—  The index IDX is trawled to identify if there is a term in
syn(kwi) U kwi, which is can be used to represent the
keyword in the vector used for retrieving fragments.
—  Once the fragments are retrieved, their constitent activities
are matched against the common activities in the initial
workflow that the user identified

Ranking Fragments
The score used for ranking candidate fragments given a user
query uq is defined as follows:
—  Relevance: denotes the score obtained usingTF-IDF
—  Frequency:A workflow fragment that is frequently used
across multiple workflow is likely to implement best
practices that are useful for the designer compared with a
fragment that apperas in, say one workflow
Compatibility(wffrg, uq.Acommon) =
aj 2Acommon
matchingScore(aj, wffrg)
|Acommon|
where matchingScore(aj, wffrg) is the matching score between aj and the
best matching activity in wffrg, and |Acommon| the number of activities
in Acommon. Compatibility(wffrg, Acommon) takes a value between 0 and 1.
The larger is the number of activities Acommon that have a matching activity
in wffrg and the higher are the matching scores, the higher is the compati-
bility between the candidate workflow fragment and the initial workflow.
Based on the above factors, we define the score used for ranking candidate
fragments given a user query uq as:
Score(wffrg, uq) =
wr.Relvance(wffrg, uq) + wf .
F requency(wffrg)
MaxF requency
+ wc.Compatibility(wffrg, uq.Acommon)
where wr, wf and wc are positive real numbers representing weights such that
wr + wf + wc = 1. MaxFrequency is positive number denoting the frequency
of the fragment that appears the maximum of times in the workflow repository
harvested. Notice that the score takes a value between 0 and 1. Once the can-
didate fragments are scored, they are ranked in the descendant order of their
associated scores. The top-k fragments, e.g., 5, are then presented to the designer
who selects to the one to be composed with the initial workflow.
5 Composing Workflow Fragments
Once the user has examined the fragments that are retrieved given the set
of keywords s/he specified, s/he can choose a fragment to be composed with

Ranking Fragment: Compatibility
To estimate the compatibility, we consider the number of
activities in Acommon that have a matching activity in the
workflow fragment and their associated matching score.
ated matching score. Specifically, we define the compatibility of a workflow
fragment given the activities uq.Acommon specified in the user query uq as
follows:
Compatibility(wffrg, uq.Acommon) =
P
aj 2Acommon
matchingScore(aj, wffrg)
|Acommon|
where matchingScore(aj, wffrg) is the matching score between aj and the
best matching activity in wffrg, and |Acommon| the number of activities
in Acommon. Compatibility(wffrg, Acommon) takes a value between 0 and 1.
The larger is the number of activities Acommon that have a matching activity
in wffrg and the higher are the matching scores, the higher is the compati-
bility between the candidate workflow fragment and the initial workflow.
Based on the above factors, we define the score used for ranking candidate
fragments given a user query uq as:
Score(wffrg, uq) =
wr.Relvance(wffrg, uq) + wf .
F requency(wffrg)
MaxF requency
+ wc.Compatibility(wffrg, uq.Acommon)

Composing Workflow Fragments
configurable split operator to connect the two workflows as illustrated in Figure
10 (lines 9 and 10). We call such an operator configurable because it is up
to the user to choose if such an operator is of a type and split, or split
or xor split. The preceding activity of such an operator is acommon and its
succeeding activities are the succeeding activities of wfinitial and wffrg.
Fig. 8: Merging the initial workflow and a fragment in sequence based on the
common activity acommon.
Fig. 9: Merging the preceding activities of the initial workflow and a fragment
using a configurable join control operator.
Fig. 10: Merging the succeeding activities of the initial workflow and a fragment
Fig. 9: Merging the preceding activities of the initial workflow and a fragment
using a configurable join control operator.
Fig. 10: Merging the succeeding activities of the initial workflow and a fragment
using a configurable split control operator.

Example from eScience
To assist the designer is specifying a workflow for profiling genes that are suspected
to be involved in a genetic disease, in this case the huntignton disease.
people. Although the genetic mutation that causes HD was identified 20 years
ago [19], the downstream molecular mechanisms leading to the HD phenotype
are still poorly understood. Relating alterations in gene expression to epigenetic
information might shed light on the disease aetiology. With this in mind, the
scientist wanted to specify a workflow that annotate HD gene expression data
with publicly available epigenetic data [20].
Fig. 11: Initial Huntington Disease profiling workflow.
The initial workflow specified by the designer is illustrated in Figure 11.
The workflow contains three activities that are ordered in sequence. The first
activity MapGeneT oConcept is used to extract the concepts that are used in
domain ontologies to characterize a given gene that is known or suspected to
be involved in an infection or disease, in this case the Huntigton Disease. The
second activity IndentifySimilarConcepts is then used to identify for each of
those concept the concepts that are similar. The rational behind doing so is that
genes that are associated with concepts that are similar to concepts that are
involved in the disease have chances of being also involved in the disease. This
activity involves running ontological similarity tests to identify similar concepts.
The third activity GetConceptIds is then used to extract the identifiers of the
similar concepts.
keywords = {concept,
score,
rank,
profile}
the common activity in the initial workflow, GetConceptIds. Still, our method
was able detect it as a common activity thanks to the string similarity utilized.
Fig. 12: Candidate fragment selected by the workflow designer.
By applying the composition algorithm presented in Section 5 to merge the
initial workflow and the fragment selected by the user, we obtained the workflow
depicted in Figure 13. Indeed, the common activity has no succeeding activity
in the initial workflow and has no preceding activity in the workflow fragment
(lines 1, 2 in Figure 7). Therefore, the two workflows are connected ins sequence
using the common activity.

Example from eScience
To assist the designer is specifying a workflow for profiling genes that are suspected
to be involved in a genetic disease, in this case the huntignton disease.
people. Although the genetic mutation that causes HD was identified 20 years
ago [19], the downstream molecular mechanisms leading to the HD phenotype
are still poorly understood. Relating alterations in gene expression to epigenetic
information might shed light on the disease aetiology. With this in mind, the
scientist wanted to specify a workflow that annotate HD gene expression data
with publicly available epigenetic data [20].
Fig. 11: Initial Huntington Disease profiling workflow.
The initial workflow specified by the designer is illustrated in Figure 11.
The workflow contains three activities that are ordered in sequence. The first
activity MapGeneT oConcept is used to extract the concepts that are used in
domain ontologies to characterize a given gene that is known or suspected to
be involved in an infection or disease, in this case the Huntigton Disease. The
second activity IndentifySimilarConcepts is then used to identify for each of
those concept the concepts that are similar. The rational behind doing so is that
genes that are associated with concepts that are similar to concepts that are
involved in the disease have chances of being also involved in the disease. This
activity involves running ontological similarity tests to identify similar concepts.
The third activity GetConceptIds is then used to extract the identifiers of the
similar concepts.
keywords = {concept,
score,
rank,
profile}
Fig. 13: Workflow obtained by merging the initial workflow and selected frag-
ment.

Conclusions
1.  The method we have described for mining workflow
fragments have been empirically tested.
2.  The method for retrieving and composing workflow
fragments has been used in an eScience use case to assist
the design of a Huntington Disease workflow
3.  The work on mining frequent workflow fragments has
been published in the International Keystone conference
4.  The extended version with results on the retrieval and
composition of workflow fragments has been submitted to
a special issue on keyword search of the Springer Journal:
Transactions on Computational Collective Intelligence

Anr cair meeting feb 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Anr cair meeting feb 2016

Similar to Anr cair meeting feb 2016 (20)

More from Khalid Belhajjame

More from Khalid Belhajjame (19)

Recently uploaded

Recently uploaded (20)

Anr cair meeting feb 2016

Editor's Notes