citation:
Missier, P., Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., et al. (2010).
Taverna, reloaded. In M. Gertz, T. Hey, & B. Ludaescher (Eds.), Procs. SSDBM 2010. Heidelberg, Germany.
Paper available at: http://www.ssdbm2010.org/.
HTML Injection Attacks: Impact and Mitigation Strategies
Paper presentation: Taverna, reloaded
1. , reloaded
Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Alex Nenadic,
Ian Dunlop, Alan Williams, Carole Goble
School of Computer Science
University of Manchester, UK
Wei Tan
Tom Oinn
Argonne National Laboratory
Argonne, IL, USA EBI, UK
SSDBM
Heidelberg, Germany
June 30 - July 2, 2010
2. Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types
source: caGrid
http://www/myexperiment.org/workflows/746 2
3. Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types
lymphoma samples
source: caGrid ➔ hybridization data
http://www/myexperiment.org/workflows/746 2
4. Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types
lymphoma samples
source: caGrid ➔ hybridization data
process microarray
data as training dataset
http://www/myexperiment.org/workflows/746 2
5. Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types
lymphoma samples
source: caGrid ➔ hybridization data
process microarray
data as training dataset
learn
predictive model
http://www/myexperiment.org/workflows/746 2
6. Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types
lymphoma samples
source: caGrid ➔ hybridization data
http://www/myexperiment.org/workflows/746 2
7. In this (short) talk...
• Taverna for data-intensive science
– architectural goals... and how they are being achieved
• Performance figures on benchmark workflows
3
8. Target -ilities
• Scalability
– size of input data / size of input collections
• Configurability
– each processor in the workflow individually tunable to
adjust to the underlying platform
• Extensibility
– plugin architecture to accommodate specific service
groups
• caGrid, BioMoby
– extensible set of operators that define the semantics of the
model
4
9. Opportunities for parallel processing
• intra-processor: implicit iteration over list data
• inter-processor: pipelining
[ id1, id2, id3, ...]
5
10. Opportunities for parallel processing
• intra-processor: implicit iteration over list data
• inter-processor: pipelining
[ id1, id2, id3, ...]
SFH1 SFH2 SFH3
... ... ...
getDS1 getDS2 getDS3
5
11. Opportunities for parallel processing
• intra-processor: implicit iteration over list data
• inter-processor: pipelining
[ id1, id2, id3, ...]
SFH1 SFH2 SFH3
... ... ...
getDS1 getDS2 getDS3
[ DS1, DS2, DS3, ...]
5
12. Configurable processor behaviour
allocates concurrent
threads to list elements
Default dispatch stack
Request
Parallelise
Provenance
captures service
Error bounce invocation events
Failover
Response
Retry persistent server error:
tries each available
Stop/pause
service implementation
Invoke in turn
transient errors:
interacts with retries one service
underlying activity invocation
- service operation
- shell script execution
14. Extensible processor semantics
Default dispatch stack Default dispatch stack Extended dispatch stack
Request
Request
Request
Parallelise
Enhancing dataflow model Parallelise
with limited form of loops to
Parallelise
Provenance Error bounce
model busy waiting
Error bounce
Error bounce (asynchronous service Loop/Branching
interaction) Failover
Failover Failover
Response
Retry
Response
Response
Retry Retry
Invoke
Stop/pause Invoke
Invoke
(a) (b)
7
15. Extensible processor semantics
Default dispatch stack Default dispatch stack Extended dispatch stack
Request
Request
Request
Parallelise
Enhancing dataflow model Parallelise
with limited form of loops to
Parallelise
Provenance Error bounce
model busy waiting
Error bounce
Error bounce (asynchronous service Loop/Branching
interaction) Failover
Failover Failover
Response
Retry
Response
Response
Retry Retry
Invoke
Stop/pause Invoke
."4&04& 7';'3%
Invoke ;(4)4&0;=;"2.'4
'&&'./304&5,%& +"#,- (a) (b)
!"#$#%&'()(*+,-%$.*//$0*1
2*(34/*56#%.8&9
+"#,- &670
Busy waiting !"#$)* 80&$9:5$;0%(<&
using an
;0%(<& '&&'./304&5,%&
explicit +"#,-
workflow ./0.12&'&(%
%&'&(% '&&'./304&5,%&
pattern 2*(34/*5678&.8&9
+"#$%&'&(%
!"#$%&'&(% )%$-"40
,%$-"40
&0%&
2(..0%%
7
16. Extensible processor semantics
Default dispatch stack Default dispatch stack Extended dispatch stack
Request
Request
Request
Parallelise
Enhancing dataflow model Parallelise
with limited form of loops to
Parallelise
Provenance Error bounce
model busy waiting
Error bounce
Error bounce (asynchronous service Loop/Branching
interaction) Failover
Failover Failover
Response
Retry
Response
Response
Retry Retry
Invoke
Stop/pause Invoke
."4&04& 7';'3%
Invoke ;(4)4&0;=;"2.'4
'&&'./304&5,%& +"#,- (a) (b)
!"#$#%&'()(*+,-%$.*//$0*1
2*(34/*56#%.8&9
+"#,- &670 (3/360 4"7&)7&
Busy waiting !"#$)* 80&$9:5$;0%(<& /1787&)/9/":437
using an
;0%(<& '&&'./304&5,%& 3&&3456)7&.$0& !"#$%
explicit +"#,- !"#$%
workflow ./0.12&'&(%
%&'&(% '&&'./304&5,%&
Busy waiting 45)4;:&3&10
3&&3456)7&.$0& 0&3&10
pattern using an
2*(34/*5678&.8&9
+"#$%&'&(% loop !"#$% &'()
!"#$%&'&(% )%$-"40
,%$-"40
processor *)&+,-.+/)012&
/)012& 3&&3456)7&.$0&
&0%&
2(..0%%
7
17. Performance experimental setup
• previous version of Taverna engine used as baseline
• objective: to measure incremental improvement
list generator
Parameters:
- byte size of list elements (strings)
- size of input list
- length of linear chain
multiple parallel pipelines
main insight: when the workflow is designed for
pipelining, parallelism is exploited effectively
8
18. I - Memory usage
shorter execution
time due to pipelining
T2 main memory
data management
T2 main embedded Derby back-end
T1 baseline
list size: 1,000 strings of 10K chars each
no intra-processor parallelism (1 thread/processor)
9
19. I - Memory usage
shorter execution
time due to pipelining
T2 main memory
data management
T2 main embedded Derby back-end
T1 baseline
list size: 1,000 strings of 10K chars each
no intra-processor parallelism (1 thread/processor)
9
21. Bounded main memory usage
Separation of data and process spaces ensures scalable data management
varying data element size:
10K, 25K, 100K chars
11
22. Summary
• Taverna 2 re-engineered for scalability on data-
intensive pipelined dataflows
• separation of data and process spaces
• generic stack pattern for principled extensions
– limited while loop, if-then-else
– provenance capture
– ... open plugin architecture accommodates further
extensions
For further details and a nice chat:
please visit our poster!!
12