"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Exploiting Multicores to Optimize Business Process Execution
1. Exploiting Multicores to
Optimize Business Process
Execution
Daniele Bonetta,
Achille Peternier, Cesare Pautasso, Walter Binder
Faculty of Informatics
University of Lugano - USI
Switzerland
http://sosoa.inf.usi.ch
daniele.bonetta@usi.ch
Tuesday, December 14, 2010
2. BP Execution Engine
Focus: Business Process Runtime
Execution Environment
Web
Service
Composite
BP Web
Client Web
Execution Service
Service
Engine Web
Service
Tuesday, December 14, 2010
3. How to scale?
Client
ClientClient
Web
Client
Client Service
Client Web
Composite
BP
Client
Client Service
Web
Execution
Client Service
Engine Web
Client
Service
Client
Client
Client
Client
Tuesday, December 14, 2010
4. Client How to scale?
Client
Client
Client Client Client
Client Client
Client
Client Client Web
ent Client
Client Service
ent Client
Client Web
Client Composite
BP
Client
Client Service
ent Web
Execution
Client Client Service Web
Client Engine
ient Client
Client Service
Client Client
Client
Client
Client
Client
Client Client
Client ClientClient
Client Client
Client Client
Tuesday, December 14, 2010
5. Client How to scale?
Client
Client
Client Client Client
Client Client
Client
Client Client Web
ent Client
Client Service
ent Client
Client Web
Client Composite
Service
More clients == More BP Instances
Client
Client Web
Composition Service
ent
Client Client
Client Service
Engine Web
ient Client
Client Service
Client Client
Client
Client
Client
Client
Client Client
Client ClientClient
Client Client
Client Client
Tuesday, December 14, 2010
6. Client How to scale?
Client
Client
Client Client Client
Client Client
Client
Client Client Web
ent Client
Client Service
ent Client
Client Web
Client Composite
Service
More clients <= More BP Instances
Client
Client Web
Composition Service
ent
Client Client
Client Service
Engine Web
ient Client
Client Service
Client Client
Client
Client
Client
Client
Client Client
Client ClientClient
Client Client
Client Client
Tuesday, December 14, 2010
8. Outline
1. Multicore Issues
2. JOpera Business Process Execution Engine
1. Thread Level Parallelism
2. CPU/Core Level Parallelism
3. Experimental Results
4. Conclusion
Tuesday, December 14, 2010
9. Multicore Issues
• Number of cores
• Type of cores (e.g.
SMT)
• On Chip Caching
Layout (e.g. L2, L3...)
• On Board Memory
Layout (e.g. NUMA,
NUMA-CC, ...)
Tuesday, December 14, 2010
10. Multicore Issues
• Cores Num Th Migrations,
• Cores Type Ctx Switches
• Cache Layout
• Memory Layout
Tuesday, December 14, 2010
11. Multicore Issues
• Cores Num Th Migrations,
• Cores Type Ctx Switches
• Cache Layout Data Locality,
• Memory Layout Contention
Tuesday, December 14, 2010
12. BP Execution Engine
Java Business Process Execution Engine
Tuesday, December 14, 2010
13. 3 Layers Approach
Concurrent Business
Process Instances
OS Threads
Hardware Cores
Tuesday, December 14, 2010
14. Abstraction Layers
Concurrent Business
Process Instances
OS Threads
Hardware Cores
Tuesday, December 14, 2010
31. Experimental Setup
Concurrent Business
Process Instances Up to 30’000
OS Threads
k
Hardware Cores
12
Tuesday, December 14, 2010
32. Thread-level Parallelism
How many threads?
Just increase the number of
parallel concurrent threads
in the pools for an increasing
number of instances?
Tuesday, December 14, 2010
33. Thread-level Parallelism
Just increasing the number of threads...
1800 ForEach
Sequential
1600 Parallel
Loop
Throughput (req/s)
1400
Throughput (Instances/sec)
1200
1000
800
600
400
200
0 20 40 60 80 100 120 140
Number of threads (per pool)
# of threads
Tuesday, December 14, 2010
34. Experimental Setup
Concurrent Business
Process Instances Up to 30’000
OS Threads
24
Hardware Cores
12
Tuesday, December 14, 2010
40. Experimental Setup
Concurrent Business
Process Instances Up to 30’000
OS Threads
24
Hardware Cores
12
Tuesday, December 14, 2010
41. Performance Layers
Concurrent Business
Process Instances 5’000 - 30’000
Throughput, Walltime, ...
OS Threads
24
Hardware Performance Counters:
Hardware Thread Migrations, Context sw, ...
Cache miss, Cores
12
Tuesday, December 14, 2010
42. Experimental Results
Relative Speedup with 30k instances
30000 Instances
1.3 Default
Per CPU
Per core
1.2 Interleaved
Relative Speedup
1.1
1
0.9
0.8
0.7
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
43. HPC-Based Validation
Ineffective sw prefetches
A prefetch request for a memory
address already in the cache
L3 cache evictions
Data that needs to be stored in the
cache is bigger than free available space
Tuesday, December 14, 2010
44. Experimental Results
Ineffective SW prefetches
30000 Instances
1.2 Default
Per CPU
Relative Ineffective SW Prefetches
Per core
1.1 Interleaved
1
0.9
0.8
0.7
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
45. Experimental Results
L3 Cache evictions
30000 Instances
Default
2 Per CPU
Per core
1.8 Interleaved
Relative L3 Evictions
1.6
1.4
1.2
1
0.8
0.6
0.4
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
46. Experimental Results
Relative Speedup with 5k instances
10000 Instances
1.2 Default
Per CPU
Per core
1.1 Interleaved
Relative Speedup
1
0.9
0.8
0.7
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
47. Experimental Results
Ineffective SW prefetches
10000 Instances
1.2 Default
Per CPU
Relative Ineffective SW Prefetches
Per core
1.1 Interleaved
1
0.9
0.8
0.7
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
48. Experimental Results
L3 Cache evictions
10000 Instances
1.8 Default
Per CPU
1.6 Per core
Interleaved
Relative L3 Evictions
1.4
1.2
1
0.8
0.6
0.4
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
49. Experimental Results
Relative Speedup with 10k instances
5000 Instances
1.3 Default
Per CPU
Per core
1.2 Interleaved
Relative Speedup
1.1
1
0.9
0.8
0.7
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
50. Experimental Results
Ineffective SW prefetches
5000 Instances
1.2 Default
Per CPU
Relative Ineffective SW Prefetches
Per core
1.1 Interleaved
1
0.9
0.8
0.7
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
51. Experimental Results
L3 Cache evictions
5000 Instances
Default
2 Per CPU
Per core
1.8 Interleaved
Relative L3 Evictions
1.6
1.4
1.2
1
0.8
0.6
0.4
DAG Parallel Sequential Loop Geomean
2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
53. Conclusion
• Multicore machines offer powerful hardware
parallelism, but what matters is not just the
number of PEs
• The performance depends on how a limited
amount of threads are mapped to the HW
• Multicore Aware Thread Scheduling
significantly impacts the performance (up to
10% speedup)
Tuesday, December 14, 2010
54. Thank you!
OverHPC Library:
http://sosoa.inf.usi.ch
JOpera business process execution engine:
http://www.jopera.org
Twitter:
@jopera_org
me:
daniele.bonetta@usi.ch
Tuesday, December 14, 2010