"Can programming of multi-core systems be easier, please? The ALMA Approach"
By Oliver Oey, Karlsruhe Institute of Technologie - KIT for ScilabTEC 2015
1. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 1
FP7-ICT-2011-7-287733
ALMA Project Overview
Simplifying programming for multi-cores
Oliver Oey
2. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 2
Outline
ALMA EU Project Overview
Project Overview
Motivation
Results
MatrixFrontend
Type inference
Loopify
Simplify
emmtrix Technologies
Summary
3. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 3
ALMA Project ID Card
Three year project: 01/09/2011 – 31/01/2015
Funded by FP7: 3.2 Million Euros
Official web site: http://www.alma-project.eu/
Coordinator: Juergen Becker (KIT) and Timo Stripf (KIT)
Scientific Coordinator: Nikos Voros (TWG)
4. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 4
Why do we need multi-core processors?
Until ~2005 processor
performance increase
driven by
Clock speed
Execution optimization
Cache
Power wall
ILP wall
Led to multicore
processors
Parallelism must be
exposed by the
programmer
(source http://www.gotw.ca/publications/concurrency-ddj.htm)
5. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 5
Motivation
End user perspective Target architecture perspective
• Explore/Develop algorithms
• Use a simple, comfortable language
• E.g. Matlab, Scilab, …
• Don’t want to care about
• data types
• parallelism
• End result
• Performance
• Energy efficient
• Cost efficient
• Fast development time
• Multi-Processor System-on-Chip
• Parallel processor cores
• Explicit parallel programming
• Distributed memory model, e.g. MPI
• Parallelism within the processor
cores
• Single Instruction Multiple Data
• Very Long Instruction Word
• Native data types
• E.g. 32-bit integer
• Floating-point perform inefficient
Hide the complexity from the end user
6. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 6
ALMA Development Flow (overview)
Optimized
application code on
multi-core platform
Embedded application design Multi-core hardware design
Translation to
Scilab &
annotations
Abstract
hardware
description
(ADL)
KIT
C-compiler
Multi-core
simulator
Parameters for
algorithm
optimization
C-based code with parallel descriptions
ALMA
algorithm
parallelization
tools
Executable binary (for simulator and HW)
Recore
C-compiler
Structural
hardware
description
Feedback for
optimization
7. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 7
Challenges for Compiling Scilab to MPSoCs
Scilab programming language
Sequential, imperative language
Dynamic typing (scalars, vectors, matrices)
End users typically use floating-point data types
Pointer-free, i.e. no memory aliasing problems
Natural parallelism within vector operations
MPSoC target architectures
Exploit coarse-grain parallelism (task-level)
Distributed memory
Exploit fine-grain parallelism (instruction-level)
8. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 8
ALMA Target Architectures
Xentium® processing tile
Fixed-point DSP processing
10-issue VLIW processor
SIMD capability
Streaming communication services
Multicore Architectures
Distributed memory
=> No shared memory required
No floating point unit
=> Use fixed-point arithmetic
Example Architecture: Recore X2014
9. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 9
Application Test Cases - Telecommunications
Rx
1
Rx
NR
FFT
Equaliz
er
Channe
l
Estimat
or
Derand
o mizer
Deinter
leaver
Symbol
Decons
truction
- Cyclic
Prefix
Diversity
Combine
r
- Cyclic
Prefix
FFT
SDU
Gener
ation
Data
SDU
s
Uplink
Frame
Decon
structio
n
MAC
-PHY
I/F
BS Rx
`
ALMA 1st
Increment
ALMA 2nd
Increment Tx 1
Tx
NT
FEC
Enco
der
Interl
eaver
Constel
.
Mappin
g
IFFT
+ Cyclic
Prefix
S-T
Coding
IFFT
+
Cyclic
Prefix
+ Pre
amble
Data
SDU
s
PHY
MA
C
UL/DL
Frame
Mappe
r
UL/DL
Sched
uler
BS Tx
PDU
Generati
on
MAC
-PHY
I/F
Fram
e
Cons
tructi
on
Downlin
k
MAC/P
HY
Control
Symb
ol
Const
ructio
n
Rand
omiz
er
.
.
.
..
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
FEC
Decode
r
Const.
Demap
IEEE 802.16e PHY Layer
in NT x NR MIMO
Configuration
Speedup:
~2,8
11. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 11
0
20
40
60
80
Telecommunication Image processing
Workingdays
Manual
Using autom.
Parallelization
Development effort
-57% -30%
Reduction of development effort by partially over 50%
12. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 12
ALMA Workflow
Parallel
C Code
Development
Cycle II
Development using
Scilab
Development
Cycle I
ALMA
Parallelization
Tools
Testing
plattform
CPU
CPU CPU
CPU
Testing
PC
Multi-core
Processor
13. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 13
ALMA Workflow (Details)
Parallel C
Code
Development Cycle I
Development with
Scilab
Sequential
Static
C Code
Paralleliza
tion
Matrix
Frontend
Parallelization
Development Cycle II
14. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 14
Outline
ALMA EU Project Overview
Project Overview
Motivation
Results
MatrixFrontend
Type inference
Loopify
Simplify
emmtrix Technologies
Summary
15. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 15
Matrix Frontend
Parallel C
CodeDevelopment with
Scilab
Sequential
Static
C Code
Paralleliza
tion
Matrix
Frontend
Parallelization
Scilab-to-C Compiler
Parses Scilab code
Advanced type inference
High-level optimizations on Scilab
code
Turns Scilab statements into loop
nests
Generated C Code
Optimized for parallelism extraction
Static memory allocation
Avoid pointers
16. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 16
Requirements
Source language
Support Scilab input language
Support well-defined subset
Extend with annotation
for type inference
for parallelization
Annotated code should still be
valid Scilab/Matlab code
Target language
Generate ANSI C89 code
Polyhedral code
Large Static Control Parts
Avoid pointers
Static code
No dynamic memory
allocation
Avoid run-time decisions
17. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 17
Type Inference
Calculate types for expressions and variables
“Type” = “Data Type” + “Shape”
Separated into 3 passes
1. Shape Inference
2. Data Type Inference
3. Variable Inference
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
18. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 18
Type Inference - Shape
Calculate shape of each Scilab statement
s = [1 2 3]; // s = 1x3
for f = 1:10 // f = 1x1
s = s + f // s = 1x3
end
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
19. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 19
Type Inference – Growing Arrays
Support growing arrays
a = 1;
a(1,5) = 1;
[1 0 0 0 1]
Maximum size must be known!
What happens if matrix is indexed by variable?
a(1,b) = 1; // Maximum value of b unknown
Two solutions:
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
a = zeros(1,5);
mfe_fixedsize(a);
a = 1;
a(1,b) = 1;
a = 1;
mfe_size(a, 1, 1:5);
a(1,b) = 1;
20. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 20
Type Inference – Data Type
Scilab has data type function
double
int32, int16, int8
uint32, uint16, uint8
boolean
complex, real, imag
a = uint8([255 256]);
[255 0]
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
21. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 21
Type Inference – Data Type (2)
Problem: Data type is run-time specific
sqrt(1) => double
sqrt(-1) => complex double
sqrt(a) => ?
We cannot guarantee Scilab conform
execution
Solution
Generate warning
Ask end user to specify data type
real(sqrt(a)) => double
sqrt(complex(a)) => complex double
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
22. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 22
Type Inference – Variable
Shape and data type inference operate on
expressions
Assign shape/data type to variables
Data type
Limitation: Data type cannot change at run time
a = 1;
a = uint8(1);
Complex flag is “or” connected
a = 1;
a = %i;
complex_double_t a;
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
23. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 23
Type Inference – Variable (2)
Shape
Variable shape is maximum of all dimensions
a = zeros(1,3);
a = zeros(4,1);
double a[4,3];
Limitation: Number of dimensions cannot
change
a = zeros(3,3);
a = zeros(3,3,3);
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
24. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 24
Loopify
Translates Matlab/Scilab variables into
Data
Dynamic size
Static (maximum) size
Translates Matlab/Scilab statements into
Loop nest
Size calculation
Scilab C code
a = zeros(2,3); int32_t a_data[3][2] = {{0}};
int32_t a_size[2];
const int32_t a_ssize[2] = {2, 3};
for (v1 = 0; v1 < 3; ++v1) {
for (v0 = 0; v0 < 2; ++v0) {
a_data[v1][v0] = 0;
}
}
a_size[0] = 2;
a_size[1] = 3;
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
25. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 25
Simplify
Remove unnecessary “for loops”
Remove unnecessary variable dimensions
Remove size variables and statements for fixed
size variables
Scilab C code
a = 1;
(before simplify)
int32_t a_data[1][1] = {{0}};
…
for (v1 = 0; v1 < 1; ++v1) {
for (v0 = 0; v0 < 1; ++v0) {
a_data[v1][v0] = 1;
}
}
a = 1;
(after simplify)
int32_t a_data = 0;
…
a_data[v1][v0] = 1;
Scilab
Type Inference
Loopify
Simplify
C Code Output
C Code
26. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 26
Results – Lines of Code
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
SIFT Magic IFFT Intracom
Scilab
C (After Simplify)
C (Before Simplify)
27. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 27
Start-up company
Solutions for a parallel world
Will be founded from KIT with results from ALMA
www.emmtrix.com
28. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 28
Interactive Parallelization
Control parallelization by high-level decisions in GUI
Control, Traceability, Usability
Automatic test generation
Reliability
CPU
CPU
CPUCPU
CPU
29. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 29
emmtrix Workflow Integration
Parallel
C Code
Verification
Development with
Scilab
Iteration
emmtrix
Parallelization
Solution Test Platform
CPU
CPU CPU
CPU
Test PC
Multicore
Processor
Integration into Scilab workflow
Planned Xcos integration for model-based design
30. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 30
Plans for emmtrix
Soon:
Release of MatrixFrontend for Scilab community
Free to use
Convert Scilab code to C code
Product launch of emmtrix Parallel Studio (not final name)
at Embedded World 2016 (Feb, 2016)
Integration into workflow
Support for different hardware platforms
Support for model-based design
31. FP7-ICT-2011-7-287733 – ScilabTEC – Oliver Oey – oliver.oey@kit.edu 31
Summary
ALMA Toolchain
MatrixFrontend: Convert Scilab code to C
Parallelization of generated code
Speedup development for multi-core systems by 30-60%
emmtrix Technologies
Distribution of ALMA results
Free Scilab to C converter: Matrix Frontend
Interactive parallelization tool