High Level Synthesis of Algorithms with Pointers

POLITECNICO DI TORINO
Facolt`a di Ingegneria dell’Informazione
Corso di Laurea in Ingegneria Elettronica
Tesi di Laurea
Support architecture for high level
synthesis of algorithms strongly
based on pointers
Relatore:
prof. Mario Casu
Candidato:
Alessandro Renzi
Marzo 2015

Table of contents
1 Introduction 1
1.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 High Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Limits of High Level Synthesis . . . . . . . . . . . . . . . . . . . . . 5
2 Case study 6
2.1 Proﬁling OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Candidate selection . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Algorithm structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Acceleration target . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 The memory model problem . . . . . . . . . . . . . . . . . . . . . . . 12
3 The architecture 13
3.1 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Value ﬁled . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4 Set methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.5 Get method . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.6 GetPtr methods . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.7 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Ram 20
4.1 Write method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Read method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
I

4.3 OutOfBound checking method . . . . . . . . . . . . . . . . . . . . . 23
4.4 Memset method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 RingRam 25
5.1 Write method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Read method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4 Memset method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5 Address translation method . . . . . . . . . . . . . . . . . . . . . . . 31
5.6 StepForward method . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.7 DryStepForward method . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 VirtualBuﬀer 34
6.1 Write methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Read methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.4 Memset methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.5 Address translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.6 Starting address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.7 StepForward method . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.8 DryStepForward method . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Exception Handling 43
7.1 Proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.1.1 Exceptions encoding . . . . . . . . . . . . . . . . . . . . . . . 45
7.1.2 Exceptions throwing . . . . . . . . . . . . . . . . . . . . . . . 46
7.1.3 Exceptions retrieving . . . . . . . . . . . . . . . . . . . . . . 46
8 Integration 49
8.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.2 FIFO model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.2.1 isEmpty method . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2.2 isFull method . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2.3 Put method . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2.4 Get method . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.3 Helper functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.3.1 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.3.2 Parameters container . . . . . . . . . . . . . . . . . . . . . . 55
8.3.3 Helper functions . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.4 Systematic integration procedure . . . . . . . . . . . . . . . . . . . . 62
8.4.1 The surroundings . . . . . . . . . . . . . . . . . . . . . . . . 62
II

8.4.2 The actual algorithm . . . . . . . . . . . . . . . . . . . . . . 66
8.5 SystemC module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.6 Functional validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9 Conclusions 75
9.1 Future development . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Bibliography 79
III

List of figures
1.1 High Level Synthesis flow . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 High Level Synthesis steps . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Desired behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1 Ring Ram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 RingRam overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1 RingRam data shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 VirtualBuffer addressing remapped . . . . . . . . . . . . . . . . . . . 35
6.3 Complete architecture update . . . . . . . . . . . . . . . . . . . . . . 36
7.1 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2 Block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.3 Class diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.4 Starting image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.5 Original code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.1 Multiple Windows Virtual Buffer . . . . . . . . . . . . . . . . . . . . 78
IV

Chapter 1
Introduction
1.1 Intro
The exponential growth of the silicon technology allowed engineers to implement a
very large number of functionalities, with the design complexity growing exponen-
tially too.
This forced designers to develop new methodologies to handle this growing complex-
ity, for example in the beginning of the digital electronic Era based on silicon the
ICs layouts were handmade, every single transistor directly drawn. Then, when the
complexity was too high, logic synthesis paradigm was developed, defining a synthe-
sizable subset of Hardware Description Languages, which were already widely used
for simulation.
This raised the abstraction level from the circuit level to the logic function level, as
a consequence the development became faster and easier, enabling the designers to
add more complex functionalities into their projects while still being able to meet a
tight time-to-market.
Moreover this abstraction level rising (in combination with standard cell-based de-
sign) enabled an incredibly high portability and reusability value, because now the
same logic function could be integrated into different projects way more easily than
before, also allowing to experiment different technologies with the same code, for
example evaluating an ASIC implementation or an FPGA one.
Later this approach became, in turn, insufficient too, also thanks to the advent
of embedded systems, which implement a great variety of data elaboration algo-
rithms, very different from each other. Moreover also the overall system complexity
was rising very fast, so the Electronic System Level methodology (ESL) was devel-
oped along with suitable simulation languages able to support the simulation of the
1

1 – Introduction
entire system, like SystemC or SystemVerilog.
ESL methodology allowed greater capabilities in architectural exploration and val-
idation, so in order to complete the abstraction of the design flow, also High Level
Synthesis (HLS) was developed.
Figure 1.1: High Level Synthesis flow
High Level Synthesis allowed to directly translate the algorithms extracted from
the system level simulation model and implement them in hardware. Again this rise
of the abstraction level, along with methodologies like transaction-level modeling,
led to a much greater ability to handle very complex designs, greater productivity
and also greater reusability and portability of the models across different implemen-
tation technologies.
1.2 High Level Synthesis
High Level Synthesis consists of a flow of several steps [Takach(2009)], which at the
end, will produce an RTL description of the generated architecture, this architecture
will mainly be composed of a data path and a suitable control unit.
The data path is a set of functional units and registers through which the data
will flow and will be elaborated accordingly to the high level algorithm. Since the
algorithm may have branches in the program flow, the controller (essentially a finite
state machine) will have to take care of direct the data to the right functional unit
2

1 – Introduction
or register operating on multiplexers placed in the data path where needed.
The steps needed in order to produce the RTL description starting from the high
level model are the following:
• Compilation
• Allocation
• Scheduling
• Binding
• Generation
Figure 1.2: High Level Synthesis steps
3

1 – Introduction
1.2.1 Compilation
The compilation step translates the high level model into a formal representation,
usually a Control and Data Flow Graph (CDFG). A Control and Data Flow Graph
is a directed graph in which the edges represent the control ﬂow and nodes repre-
sent sequences of statements which contain no branches (basic block). This is a very
powerful representation because it allows to exhibit data and control dependencies.
An analysis of the CDFG can allow several architectural optimizations such as con-
stant folding and propagation, dead-code elimination, loop transformations and false
data dependency elimination.
1.2.2 Allocation
In the allocation step the type and the number of hardware resources are determined
in order to meet the design constraints. Some tools can choose (or can let the
user choose) to add some resources later during the scheduling or binding phases
depending on the latency and area constraints.
1.2.3 Scheduling
In this step the operations described in the high level model must be scheduled
into cycles. For this task the HLS tool need to know from the components library
the latency of each hardware resources which implement every operation. With
this information and the latency constraint the algorithm is able to schedule each
operation of the CDFG into clock cycles. If the CDFG analysis shows that there is
no data dependency between two operations, these can be scheduled in parallel if
the latency constraint requires it and the area constraint allows it.
1.2.4 Binding
The binding step is composed of two main tasks, register binding and operation
binding, in the ﬁrst each high level code variable which carries data across cycles
have to be bound to a storage unit. The algorithm can optimize register usage
binding more non-overlapping variables to the same storage unit, this means that
variables with mutually exclusive lifetimes can share the same storage unit.
Similarly the operation binding task binds scheduled operations to functional units
and if the schedule allows it, more operations can share the same functional unit.
4

1 – Introduction
1.2.5 Generation
All the preceding steps are sufficient to fully specify the architecture, so in the
generation step everything is synthesized into an RTL description which can be for
example a VHDL or Verilog code.
1.3 Limits of High Level Synthesis
Obviously there are some limits to what can be handled by a HLS tool. First of all,
the most important thing to keep in mind is the very nature of the languages which
the tools have to handle, C/C++ and SystemC are Turing-complete, so an algo-
rithm which runs on a PC would never be able to fully manage them. To be able to
do their job the tools impose that the input model uses only a not Turing-complete
subset of the original languages.
This means that infinite precision integers are not supported (but this is not a
problem since it is a feature not included in any of the listed languages), along with
recursion and dynamic memory allocation. Everything should be statically deter-
minable in order to be fully manageable by the algorithms which now can optimize
the design.
This would imply also that variable-length loops should not be supported, but since
this is a too much tight constraint and in most simple cases it is sufficient to just
break the loop (in order to make it not combinational), the tools support them but
are not able to optimize them.
Pointers are problematic too and are supported just for the most simple use cases,
in particular pointers to pointers are not supported because the tools, with a second
level of indirection, are not able anymore to follow the flow of data with a static
code analysis. This means that algorithms strongly based on pointers usage cannot
be handled by HLS tools.
But pointers are used almost everywhere and often necessary, moreover it is not
always possible to choose or suitably modify the starting algorithm to be synthe-
sized, maybe it is not even possible to be fully independent from pointers. In other
situations despite being possible it could be preferable to remain as close as possible
to the original algorithm.
So there is the need for a solution able to incorporate the pointers logic in or-
der to doesn’t require a complete change in the algorithm’s logic, while still being
manageable by HLS tools.
5

Chapter 2
Case study
In order to better explain the problem under exam and the proposed solution, the
architecture will be applied to a real world application, specifically to offload the
computation of a computer vision algorithm from OpenCV library.
2.1 Profiling OpenCV
The first step is to analyze the performances of various OpenCV algorithms in order
to search for a candidate function to accelerate.
The profiling operation is done relying on a standard linux tool called Gprof, which
is able to track the execution of a software and report some very useful statistics
about the time spent on every function.
Different kind of algorithms (taken from OpenCV official tutorials online) were an-
alyzed, the test program was almost the same every time. The frames are gathered
in real time from the camera, some pre-processing is applied, the main algorithm is
executed on the frame and then the result is shown on the screen.
The code is the following (work only with OpenCV 3):
1 #include "opencv2/opencv.hpp"
2
3 using namespace cv;
4
5 int main(int, char**)
6 {
7 VideoCapture cap(0); // open the default camera
8 if(!cap.isOpened()) // check if we succeeded
9 return -1;
10
11 Mat edges;
12 namedWindow("edges",1);
6

2 – Case study
13 for(;;)
14 {
15 Mat frame;
16 cap >> frame; // get a new frame from camera
17 cvtColor(frame, edges, COLOR_BGR2GRAY);
18 GaussianBlur(edges, edges, Size(7,7), 1.5, 1.5);
19 Canny(edges, edges, 0, 30, 3);
20 imshow("edges", edges);
21 if(waitKey(30) >= 0) break;
22 }
23
24 cap.release();
25 return 0;
26 }
The profiling must be enabled at compile time with the -g -pg flags, but there is
a problem, if the test program is compiled just with these flags, the obtained output
would be something like this (for a run-time of 30 seconds or more):
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 1.21 0.00 2536 0.00 0.00 cv::Mat::release()
0.00 1.21 0.00 2536 0.00 0.00 cv::Mat::~Mat()
0.00 1.21 0.00 1950 0.00 0.00 cv::_InputArray::init(int, void const*)
0.00 1.21 0.00 1365 0.00 0.00 cv::_InputArray::~_InputArray()
0.00 1.21 0.00 1365 0.00 0.00 cv::Size_<int>::Size_()
0.00 1.21 0.00 780 0.00 0.00 cv::_InputArray::_InputArray(cv::Mat const&)
0.00 1.21 0.00 585 0.00 0.00 cv::_InputArray::_InputArray()
0.00 1.21 0.00 585 0.00 0.00 cv::_OutputArray::_OutputArray(cv::Mat&)
0.00 1.21 0.00 585 0.00 0.00 cv::_OutputArray::~_OutputArray()
0.00 1.21 0.00 196 0.00 0.00 cv::Mat::Mat()
0.00 1.21 0.00 196 0.00 0.00 cv::String::String(char const*)
0.00 1.21 0.00 196 0.00 0.00 cv::String::~String()
0.00 1.21 0.00 196 0.00 0.00 cv::MatSize::MatSize(int*)
0.00 1.21 0.00 196 0.00 0.00 cv::MatStep::MatStep()
0.00 1.21 0.00 195 0.00 0.00 cv::Size_<int>::Size_(int, int)
0.00 1.21 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
0.00 1.21 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)
This is obviously not useful at all because every function’s execution time doesn’t
add up to the total execution time. The problem is that the profiler is not able to
enter the library’s boundaries and analyze its internal operation. The solution is to
enable the profiling also on the library, operation which requires to recompile the
entire library.
In the CMakeLists.txt file there are the options needed to enable the profiling,
in particular the ENABLE PROFILING flag must be enabled, along with EN-
ABLE OMIT FRAME POINTER flag. The last one is required because otherwise
the library won’t compile. Moreover cmake must be run with the -DBUILD SHARED LIBS=OFF
option in order to build a static library instead of a shared one, this is needed for
7

2 – Case study
the profiling, with the library linked dynamically the profiler wouldn’t be able to do
its job.
The last note is that if the system on which the test program is compiled has
an hardenized kernel, gcc will automatically add the -pie flag (Position Independent
Executable), this will make conflict with the profile flag, so -nopie flag must also be
added before compiling.
Now the test is ready to be run, once the program terminates a file named gmon.out
is created, this file contains binary profiling information, the utility called gprof will
convert it into human readable format:
gprof edge gmon.out > gmon.log
The result will be something like this (depending on the total running time):
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
58.90 0.96 0.96 cv::Canny()
12.27 1.16 0.20 cv::RowVec_8u32s::operator()() const
7.98 1.29 0.13 cv::CvtColorLoop_Invoker<RGB2Gray<uchar> >::operator()() const
6.75 1.40 0.11 cvConvertImage
4.91 1.48 0.08 cv::SymmColumnFilter<FixedPtCastEx, SymmColumnVec_32s8u>::operator()
3.68 1.54 0.06 cv::SymmColumnSmallFilter<SymmColumnSmallVec_32s16s>::operator()
1.84 1.57 0.03 cv::BaseRowFilter::~BaseRowFilter()
1.23 1.59 0.02 cv::FilterEngine::proceed(uchar const*, int, int, uchar*, int)
0.61 1.60 0.01 cv::SymmRowSmallFilter<SymmRowSmallVec_8u32s>::operator()
0.61 1.61 0.01 cv::checkHardwareSupport(int)
0.61 1.62 0.01 cv::_OutputArray::create(int, int const*, int, int, bool, int)
0.61 1.63 0.01 main
0.00 1.63 0.00 2731 0.00 0.00 cv::Mat::release()
0.00 1.63 0.00 2731 0.00 0.00 cv::Mat::~Mat()
0.00 1.63 0.00 2100 0.00 0.00 cv::_InputArray::init(int, void const*)
0.00 1.63 0.00 1470 0.00 0.00 cv::_InputArray::~_InputArray()
0.00 1.63 0.00 1470 0.00 0.00 cv::Size_<int>::Size_()
0.00 1.63 0.00 840 0.00 0.00 cv::_InputArray::_InputArray(cv::Mat const&)
0.00 1.63 0.00 630 0.00 0.00 cv::_InputArray::_InputArray()
0.00 1.63 0.00 630 0.00 0.00 cv::_OutputArray::_OutputArray(cv::Mat&)
0.00 1.63 0.00 630 0.00 0.00 cv::_OutputArray::~_OutputArray()
0.00 1.63 0.00 211 0.00 0.00 cv::Mat::Mat()
0.00 1.63 0.00 211 0.00 0.00 cv::String::String(char const*)
0.00 1.63 0.00 211 0.00 0.00 cv::String::~String()
0.00 1.63 0.00 211 0.00 0.00 cv::MatSize::MatSize(int*)
0.00 1.63 0.00 211 0.00 0.00 cv::MatStep::MatStep()
0.00 1.63 0.00 210 0.00 0.00 cv::Size_<int>::Size_(int, int)
0.00 1.63 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
0.00 1.63 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)
Now the call stack is complete and it is possible to identify the most time con-
suming function (in this case cv::canny()), even though some informations are still
missing. This is due to the flag ENABLE OMIT FRAME POINTER, which makes
some informations unavailable to the profiler, unfortunately there is nothing that
can be done to bypass this because without the flag the library won’t compile at all.
8

2 – Case study
Repeating the same on the target prototyping system (ZedBoard) gives a quite
similar result (the small differences are due to the much faster ram of the pc). The
following table summarizes the profiling of different kinds of algorithms:
Canny edge detection:
% time name
27.61 cv::Canny()
18.07 cv::SymmColumnSmallFilter::operator()
16.89 cv::RowFilter::operator()
10.46 CvCaptureCAM_V4L_CPP::retrieveFrame()
6.99 cv::CvtColorLoop_Invoker::operator()
5.02 cv::SymmRowSmallFilter::operator()
Corner detection:
% time name
21.87 cv::ocl_cornerMinEigenValVecs()
16.68 cv::ColumnSum::operator()
11.49 cv::RowSum::operator()
9.97 cv::RowFilter::operator()
5.18 cv::minMaxIdx_32f()
DFT:
% time name
38.67 cv::DFT_64f()
10.50 cv::dft()
9.82 cv::Log_32f()
6.97 cv::magnitude()
Face recognition:
% time name
84.72 cvRunHaarClassifierCascadeSum()
6.43 cvSetImagesForHaarClassifierCascade
2.97 cv::integral_
Motion detection:
% time name
50.97 cv::calcOpticalFlowFarneback()
24.66 cv::FarnebackUpdateMatrices()
7.52 cv::SymmColumnFilter::operator()
2.1.1 Candidate selection
The first candidate function to be accelerated was chosen after the following consid-
erations: the most profitable function is the one with the higher percentage of used
time, in this case the face recognition, but the HaarClassifierCascade algorithm re-
quires two very different inputs, this make it very complex, so it has to be excluded.
Since the foundations of the architecture are still to be placed the ideal candidate
9

2 – Case study
should be a simple function, this also exclude ones which uses floating-point arith-
metic.
The only function that corresponds to the criteria is the Canny Edge Detection al-
gorithm, which relies only on integer arithmetic and has a simple input and a simple
output.
2.2 Algorithm structure
Canny edge detection algorithm is composed of different sections: the first is the
preliminary setup, then two distinct derivatives are computed on the input frame
with the Sobel function, one in X direction and one in Y direction, these derivatives
are put into two temporary buffers called dx and dy.
After this there is the actual algorithm, it is composed of three loops. The first
is the bigger one, it scans through all the rows of the image.
Inside this big loop there are two smaller ones which scans through an entire line of
the image, the first computes the sum of the absolute values of each pixel of dx and
dy (for the current line), then puts the result in a temporary buffer called mag.
The second sub-loop does some further computation and compares the result to
predefined thresholds, if it is appropriate marks the pixel as belongings to an edge
by pushing it (its memory address) onto a stack and writing a fixed value into a
mapping buffer.
The second big loop extracts each (address of) pixel and checks if the pixels sur-
rounding it may belong to an edge, in that case marks the memory location pointed
by those addresses and push them onto the stack too. This loop goes until there are
no more addresses into the stack.
The third and last loop scans the entire mapping buffer and checks if each pixel
is marked as belonging to an edge, if it does, the correspondent pixel into the out-
put buffer is written with the value 255, which result in a white pixel.
Here a code summary in pseudo-language:
1 foreach row i
2 {
3 foreach column j
4 {
5 mag[j] = abs(dx[i][j]) + abs(dy[i][j])
6 }
7
8 foreach column j
9 {
10

2 – Case study
10 map[i][j] = thresholds(mag[j], dx[i][j], dy[i][j])
11 if(map[i][j] == edge)
12 {
13 push(address(map[i][j]))
14 }
15 }
16 }
17
18 while pixel in stack
19 {
20 pop(pixel)
21 if(surrounding(pixel) == candidate)
22 {
23 push()
24 }
25 }
26
27 foreach pixel p
28 {
29 if(map[p] == edge)
30 {
31 dest[p] = 255
32 }
33 else
34 {
35 dest[p] = 0
36 }
37 }
2.3 Acceleration target
An initial target to be offloaded could be the first big loop, since the majority of the
computation is done within it. So the hypothetical cuts could be after the derivatives
and at the end of the loop. Now inputs and outputs can be identified, the inputs
are the two derivatives of the input image, while the outputs are the mapping buffer
and the stack in which the pointers to the edges (in the mapping buffer) are saved.
Once the target is defined, the local buffers and their sizes can be identified, in
the specific these are: the two derivatives which are as big as the input image
(640x480) and each pixel takes two bytes, so 614.4KB for each derivative. The tem-
porary buffer as big as three lines plus two pixels for every line, where each pixel is
represented with a 4 bytes integer, so 7704 bytes in total.
Then the mapping buffer which is as big as the original image plus two rows and
two columns, at the beginning and at the end, like a frame around the image, the
11

2 – Case study
size of every pixel is one byte, so in total 309.444KB.
2.4 The memory model problem
The problem of this algorithm is that it is strongly based on pointers so there is not
a natural flow of data from an operation to the next, instead there is a strong corre-
lation with the local buffers. This means that every operation is performed reading
and writing the memory for each pixel, moreover a big part of the algorithm’s logic
is based on saving and working with pixels’ address instead of just pixels data.
This style of computation is exactly the opposite of what is needed for hardware
implementation and what is supported by high level synthesizers.
Making things more complex there is also the problem that the local buffers needed
by the algorithm are too big to be put into an ASIC or an FPGA (1.54MB in total).
12

Chapter 3
The architecture
The requirements for the architecture are that it must be able to store a lot of
data, so it will need to integrate a RAM, but this RAM must also be capable to
be addressed with the actual absolute address of the corresponding buﬀer on the
software side, in order to stream out meaningful addresses to be put in the stack’s
memory and so leaving the original logic of the algorithm untouched.
The other requirement is that the RAM cannot be too much big, so it must be able
to store just a portion of the data and then update it on demand according to what
is needed by the algorithm’s logic.
The wanted feature is to be able to overwrite the oldest data while linearly advancing
with the addressing, that means when the update operation takes place, the new
data must be loaded and be accessible by an address which is virtually outside the
bounds of the RAM. The following example should clarify this concept.
Figure 3.1: Desired behavior
The image represents the situation before the update and after the update, the
number in the box represents the address of the memory cell, so in this example there
is an 8 bytes RAM. At the beginning everything is like usual, the valid addresses
range from 0 to 7. When a computation cycle ﬁnishes and new data is needed, an
13

3 – The architecture
update command can be issued, after that the data result to be shifted but every
cell can be accessed with the same address as before, except for the first, which is
no more present in the memory, and the new data just acquired which now can be
accessed like it has always been there.
3.1 Locality
The last requirement identified imply also another one on the software side, that is,
in order to achieve a practical solution, the algorithm must present the properties
of temporal locality and sequential spatial locality.
An algorithm exposes the property of temporal locality if, at a certain point in time,
it accesses a given memory location, and it is very likely that in a short time frame
it will access the same memory location again. [J.(2005)]
Figure 3.2: Locality
Similarly the spatial locality property states that if the algorithm access a certain
memory location, very likely it will access also adjacent locations in the near future.
The sequential locality is a particular kind of spatial locality in which the memory
is accessed linearly. For example, considering the proposed case study, the image is
scanned one line at a time.
Putting together these properties the result is that the algorithm has to be capable
of working at any time only on a limited working set of data, which size will define
the minimum size of the RAMs associated to the local buffers.
14

To be more specific, considering also the application, this means that the algorithm
must be able, for example, to work with just few lines of the buffers in every cycle.
This properties can be verified studying a bit the algorithm’s code or more formally,
in an automated fashion, with a dynamic analysis of the code while it runs on a
sample input.
Note: Actually in the original code the mag buffer and the mapping buffer are
allocated within a single array, which is then managed through pointers.
It is not obvious at all to see it just by looking at the code, it can also seems that this
break the locality principle because while the mapping operations advance through
the image, the mag section is accessed always at the beginning.
Tracing the execution of the algorithm for one image can reveal this behavior, so
it is also possible to see that these two buffers are actually completely independent
one from the other and can be split into two distinct array, allowing to regain the
locality.
3.2 Pointers
The first thing to do is to redefine pointers in a more manageable way, so that HLS
tools can better handle them, but still can almost be a drop-in replacement for C++
pointers.
Moreover, since the software simulation very likely will be executed on a different
architecture with respect to the one which requires the offloading, for example the
target platform could be a 32-bits embedded processor coupled with an FPGA or a
custom ASIC, but the simulation could be run on an x86 64 architecture, it is better
to first define the size of the pointer as a 32 or 64 bits integer. This constant will be
called MachineAddrType, it can be also implemented in a way that it is configurable
at compile time with a compile flag.
Then the actual pointer implementation, it is represented by means of a C++ class
with a value field of type MachineAddrType. Pointers in C++ are not just plain
integers, they carry also the information about the pointed type, which will define
also the pointer arithmetic. This information must be integrated in some way into
the class, but in order to accommodate the HLS requirements, it turns out that the
best way to do this is to add a template to the class, because in this way everything
is statically determinable.
The template will also allow to define the pointer arithmetic once, in a generic way,
and let the compiler or the synthesizer to generate the required specializations.
15

The class name is AddrPtr which, of course, stands for Address Pointer, its decla-
ration is the following:
1 template<typename T>
2 class AddrPtr
3 {
4 private:
5 MachineAddrType value;
6
7 public:
8 AddrPtr();
9 AddrPtr(T *ptr);
10 AddrPtr(MachineAddrType val);
11
12 inline void set(T *ptr);
13 inline void set(MachineAddrType val);
14
15 inline MachineAddrType get() const;
16
17 inline T *getPtr() const;
18 inline int *getIntPtr() const;
19 inline int8_t *getInt8Ptr() const;
23
24 inline uint8_t *getUInt8Ptr() const;
28
29 AddrPtr operator+(int op) const;
30 AddrPtr operator-(int op) const;
31 int operator-(T *op) const;
32 int operator-(AddrPtr op) const; // Return int to support negative pointer
difference
33 void operator=(AddrPtr<T> addr);
34 void operator=(MachineAddrType addr);
35 };
3.2.1 Template
As just explained AddrPtr depends on the template parameter T, this also has the
nice property of making operations like assignment or sum between diﬀerent Ad-
drPtr specializations impossible, because template specializations are in all respects
16

different types, like with actual C++ pointer, such an operation would not make
sense.
3.2.2 Value filed
The value field is private according to OOP data hiding principle, in particular
because the operations on this field must respect the pointers arithmetic, so arbitrary
operations on the pointer address cannot be done. Should this ever be necessary,
a workaround is provided anyway by means of the set method and the assignment
operator overload which accepts a MachineAddrType value. But in any case resorting
to such workaround is implicitly a warning of wrong design and should be avoided.
(Of course the case study original code requires this, so even this bad practice will
be shown in order to demonstrate the flexibility of the solution)
3.2.3 Constructors
There are three constructors, the default constructor simply set the value field to
zero, the one which takes a MachineAddrType as argument initialize value to the
address provided by the argument. The last constructor takes a regular C++ pointer
and converts it to MachineAddrType, obviously the pointed type must be the same
of the provided template parameter. This last constructor is provided as utility for
simulation, usually it should not be needed in synthesis.
3.2.4 Set methods
These methods, as previously stated, should not be used regularly, but there can
be situations in which are useful, for example for variable reinitialization at the
beginning of a new cycle with new data just received from outside. As for the
constructor, the overload which takes a pointer as argument is just a utility for
simulation.
3.2.5 Get method
At some point the actual address value will be required in order to be used to access
the memory, the get method can be used to retrieve it in form of a MachineAddrType
value.
17

3.2.6 GetPtr methods
The getPtr methods can also be useful in simulation in order to cast the address
value to various pointer types, the implementation consists in a simple C++ rein-
terpret cast.
2 inline T *AddrPtr<T>::getPtr() const
3 {
4 return reinterpret_cast<T *> (value);
5 }
3.2.7 Operators
There are three groups of operators, one of these is composed of a sum and a sub-
traction operators which takes an integer as argument and returns an AddrPtr. The
purpose of these two operators is to mimic the C++ pointers arithmetic when an
offset is added or subtracted, the typical use case is for array access. The result is
again a pointer, so an AddrPtr initialized with the new address is returned. The new
address value is calculated according to the pointers arithmetic, so it depends on
the argument, but also on the size of template parameter type, which conceptually
is the size of the pointed type. These two numbers are multiplied and added to or
subtracted from the AddrPtr address value. This is because an offset in the pointers
context means the number of elements of the pointed type, not the number of bytes.
2 AddrPtr<T> AddrPtr<T>::operator+(int op) const
3 {
4 MachineAddrType result = value + (op * sizeof(T));
5 return AddrPtr<T>(result);
6 }
Another group of operator is composed of other two overloads of the minus
operator which computes the difference between two pointers, these operators takes
as argument a pointer (in the form of AddrPtr or C++ pointer) and return a signed
integer (not unsigned because a pointer difference can be also negative). Like before,
in the pointers context, the difference between two pointers does not represents the
bytes count between the two addresses, but instead it represent the element count (of
the pointed type) between the pointers. So the result must be computed subtracting
the addresses and dividing by the size of the pointed type.
18

For simulation everything is ok, but for synthesis this is quite a problem because
division cannot be implemented in a straightforward manner like adders, it requires
a dedicated module which is also very big in terms of occupied area and power
consumption.
A constraint have to be put, in particular it is sufficient to state that the architecture
will support only data types of size multiple of 8-bits and not bigger than 64-bits
(not too restrictive constraint after all) to be able to implement the operation very
efficiently. The stated constraint reduce the possible cases to just four simple and
very manageable cases, in fact the size of the pointed type now can be only 1, 2, 4
or 8 bytes, which means that a simple switch statement is sufficient to handle the
operation, and the division, since the divisor is fixed and always a power of two, can
be implemented as a right shift of the dividend, that implemented in hardware is
very simple and efficient.
2 int AddrPtr<T>::operator-(AddrPtr<T> op) const
3 {
4 int result = ((long long)value - (long long)op.get());
5 switch(sizeof(T))
6 {
7 case 2:
8 result = (result >> 1);
9 break;
10 case 4:
12 break;
13 case 8:
15 break;
16 default:
17 break;
18 }
19 return result;
20 }
The last group of operator includes the two assignment operators which simply
wrap the set method and allow to use it with the operator syntax.
2 void AddrPtr<T>::operator=(AddrPtr<T> addr)
3 {
4 set(addr.get());
5 }
19

Chapter 4
Ram
The ﬁrst layer of the architecture is the model of a RAM that will contain the mem-
ory element, a simple array of bytes. Since, as always, the sizes must be statically
determinable, the array length, so the RAM size, cannot be given at run-time when
the class is instantiated, it must be passed as a template parameter.
1 template <int Size>
2 class Ram
3 {
4 private:
5 uint8_t memory[Size];
6
7 bool exceptions[ram_ex_total];
8
10 bool checkOutOfBound(uint16_t address) const;
11 inline void throwException(ram_ex e);
12
13 public:
14 Ram();
15 inline void reset();
16
18 inline void write(uint16_t address, T data);
19
21 inline T read(uint16_t address, const T *retType);
22
24 inline void memset(uint16_t base_address, T data, uint16_t count);
25
26 inline int getSize() const
20

4 – Ram
27 {
28 return sizeof(memory);
29 }
30
31 inline uint8_t getException(); // Clear the flags
32 };
This layer is not just a simulation model which wrap the RAM logic, it also
implements generic writing and reading functions able to read and write data types
of arbitrary length on a 8-bits wide RAM.
This is done by means of templated methods which depend on another template
parameter, diﬀerent from the RAM’s size one. The new template parameter will
take care of generating specializations to operate with every data type that would
be necessary.
4.1 Write method
The writing method takes as argument the data and the address, which is repre-
sented with just 16 bits because it is already suﬃcient for the size that it will have
to handle.
1 template<int Size>
3 inline void Ram<Size>::write(uint16_t address, T data)
4 {
5 if(checkOutOfBound<T>(address))
6 {
7 throwException(ram_write_OutOfBound_ex);
8 return;
9 }
10
11 for(int i = 0; i < sizeof(T); ++i)
12 {
13 *(memory + address + i) = (0 | ((data >> (8*i)) & 0xff));
14 }
15 }
The method checks if the address goes out of the RAM’s bound and if it is the
case, raise an exception (the exception handling will be explained later) and return
immediately. If instead the address is correct, the memory is written in a loop one
byte at a time, in each cycle the data is suitably shifted and masked in order to
extract the right byte to be written in the right location.
21

4 – Ram
4.2 Read method
The reading method normally would require just one argument, the address to be
accessed, and return the data read from the memory. The returned type depends on
the template parameter, but this cannot be deduced by the compiler implicitly, so
another argument is needed to pass the information about the type to be returned.
Since the interest is only on the type and not on the actual data, this argument can
be just a const pointer of the templated type.
3 inline T Ram<Size>::read(uint16_t address, const T *retType)
4 {
6 {
7 throwException(ram_read_OutOfBound_ex);
8 return 0;
9 }
10
11 T data = 0;
12 for(int i = 0; i < sizeof(T); ++i)
13 {
14 data |= ( (T)(*(memory + address + i)) << (8*i) );
15 }
16
17 return data;
18 }
Before the actual reading takes place, the address is verified to be within the
bounds of the RAM’s size, if it is not, an exception is raised and the methods re-
turns immediately with the fixed value of 0.
In theory in such a case nothing should be returned, but this would require to pass
the output data in an output pointer received as an argument instead of returning
it. Moreover this argument is already in place to specify the wanted return type, it
would be just matter of removing the const modifier.
But, as already stated, HLS tools produce bad results with pointers and consid-
ering the specific case, the operation would not be supported at all, because the
data is already being retrieved through a pointer because of the memory array, so
22

4 – Ram
using a pointer also for the return type would mean a second level of indirection
which cannot be handled by HLS tools.
Once the address sanity is checked, the actual reading process begins, similarly to
the writing process, it is composed of a loop which length depends on the template’s
type size. This loop reads each byte from the memory and suitably shifts it in the
right position, then packs it (with an OR operation) into the temporary variable
which ﬁnal value will be returned.
4.3 OutOfBound checking method
The out of bound check is performed verifying not only that the starting address is
within the Ram’s size, but also making sure that the whole reading operation will
not go out of bound by checking that also the last address that will be accessed,
according to the given type size, is within the bound.
3 bool Ram<Size>::checkOutOfBound(uint16_t address) const
4 {
5 if((address + sizeof(T)) > getSize())
6 {
7 return true;
8 }
9 return false;
10 }
23

4 – Ram
4.4 Memset method
Another useful function for a memory is the memset operation, which repeatedly
write a certain value for a given number of times, starting from a given base address.
3 inline void Ram<Size>::memset(uint16_t base_address, T data, uint16_t count)
4 {
5 for(uint16_t i = 0; i < count; ++i)
6 {
7 write(base_address+i, data);
8 }
9 }
The implementation is a simple loop which write for the given amount of times
stated in the argument count. This is useful, for example, during the initialization
phases in which the buﬀer has to be cleared or preset to a certain value.
24

Chapter 5
RingRam
Once the plain RAM model is in place, another layer can be developed on top of
that. This layer will add the capability to update the data on demand. The desired
final result is that after an update operation the data is shifted, but implementing it
like an actual shift, copying each cell into the preceding one, discarding the first and
adding the new, would be impractical, incredibly inefficient and time consuming,
obviously it is not the correct way of doing it.
In fact the same result can be achieved a lot more efficiently just by remapping
the addresses in a very similar way of a ring buffer, so this layer will be called
RingRam, because when an update commend is triggered, the address mapping
rotates.
Figure 5.1: Ring Ram
The image shows what happens to the address mapping after issuing a single
update command. The lowest address (that means the oldest data) is written with
the new data, and an index, which is sufficient to keep track of the current map-
ping state, is incremented by one. This index represents the starting point of the
25

5 – RingRam
remapped addressing, in other words it always points the oldest data to be over-
written.
This means that there is a distinction between virtual addresses, which are the
ones that are passed as argument to the RingRam’s methods, and actual addresses,
which are the ones that are computed by the RingRam and are then passed the
Ram’s method calls.
2 class RingRam
3 {
4 private:
5 uint16_t index;
6 Ram<Size> ram;
7
8 bool exceptions[ringram_ex_total];
9
11 bool checkOutOfBound(uint16_t address) const;
12
13 uint16_t getActualAddress(uint16_t address) const;
14 inline void throwException(ringram_ex e);
15
16 public:
17 RingRam();
18 inline void reset();
19
21 inline void write(uint16_t address, T data);
22
24 inline T read(uint16_t address, const T *retType);
25
26 inline void memset(uint16_t base_address, uint8_t data, uint16_t count);
27
29 inline void stepForward(T data);
30
31 inline void dryStepForward(uint16_t count);
32
34 {
35 return ram.getSize();
36 }
37
39 };
26

5 – RingRam
Of course also the RingRam must depends on the Size template parameter and
propagates it to its own internal Ram instance. Instead of instantiating the Ram in-
side the RingRam, another evaluated solution was to make the RingRam class inherit
from Ram class, but since inheritance is a relation of the type ”is-a” [Prata(2011)],
it is clear that this is not the case, because RingRam is NOT a Ram, it does per-
form an address translation for it (plus other things), so this modeling feature is not
appropriate in this context and would lead to modeling incongruousness.
Because of the introduction of the concept of virtual addresses the reading and
writing method’s implementation changes quite a bit since now there is the prob-
lem of handling reading and writing operations across the physical (but not virtual)
bound of the memory. The following image should help to clarify this.
Figure 5.2: RingRam overflow
5.1 Write method
In case of multi-bytes memory operation the first thing to do now is to check if
there is overflow, that is, whether the operation should be wrapped to account for
the RingRam address rotation.
3 inline void RingRam<Size>::write(uint16_t address, T data)
4 {
5 // Check overflow
6 if((getActualAddress(address + sizeof(T) - 1) >= getActualAddress(address)))
7 {
8 // If no overflow it’s simple
9 if(checkOutOfBound<T>(getActualAddress(address)))
10 {
11 throwException(ringram_write_OutOfBound_ex);
27

5 – RingRam
12 return;
13 }
14 ram.write(getActualAddress(address), (T)data);
15 return;
16 }
17
18 uint8_t temp[sizeof(T)];
19 int i = 0;
20 for(i = 0; i < sizeof(T); ++i)
21 {
22 temp[i] = (0 | ((data >> (8*i)) & 0xff));
23 }
24
25 uint16_t reladdr = address + index;
26 uint16_t maxaddr = reladdr + sizeof(T) - 1;
27
28 // Write until the ram’s max size
29 for(i = 0; (reladdr + i) < ram.getSize(); ++i)
30 {
31 ram.write(reladdr + i, temp[i]);
32 }
33
34 // Write the remaining bytes at the beginning of the ram
35 for(int j = 0; j < (maxaddr - ram.getSize()); ++i, ++j)
36 {
37 ram.write(j, temp[i]);
38 }
39
40 return;
41 }
If the writing operation does not overflows it is possible to forward the write
call to the Ram because the operation can be handled normally as the simple Ram
would do, of course only after computing the translated actual address.
If instead the writing overflows, the wrapping have to be handled carefully, firstly
the input data is divided in bytes resorting to a temporary array, then there are
two loops, the first writes each byte of the array until the physical end of the Ram
is reached, the second continues the writing starting from the physical beginning of
the Ram until all the bytes of the temporary array are written.
28

5 – RingRam
5.2 Read method
The reading method is quite similar, it checks the overﬂow too and if there is none
the reading parameters are passed to the Ram instance method call, again after
translating the virtual address into the actual one.
3 inline T RingRam<Size>::read(uint16_t address, const T *retType)
4 {
5 // Check overflow
6 if((getActualAddress(address + sizeof(T) - 1) >= getActualAddress(address)))
7 {
8 // If no overflow it’s simple
9 if(checkOutOfBound<T>(getActualAddress(address)))
10 {
11 throwException(ringram_write_OutOfBound_ex);
12 return 0;
13 }
14 return ram.read(getActualAddress(address), retType);
15 }
16
17 uint8_t temp[sizeof(T)];
18
19 uint16_t reladdr = address + index;
20 uint16_t maxaddr = reladdr + sizeof(T) - 1;
21
22 int i = 0;
23 // Read until the ram’s max size
24 for(i = 0; (reladdr + i) < ram.getSize(); ++i)
25 {
26 temp[i] = ram.read(reladdr + i, temp);
27 }
28
29 // Read the remaining bytes at the beginning of the ram
30 for(int j = 0; j < (maxaddr - ram.getSize()); ++i, ++j)
31 {
32 temp[i] = ram.read(j, temp);
33 }
34
35 T data = 0;
36 for(i = 0; i < sizeof(T); ++i)
37 {
38 data |= ((T)temp[i] << (8*i));
39 }
40
41 return data;
29

5 – RingRam
42 }
As before, if there is overflow, the reading operation have to be wrapped. To do
this there is again the need for a temporary bytes array, which is filled by two loops.
The first reads until the physical end of the Ram, the second finishes the reading
starting from the physical beginning of the Ram.
Once the temporary array has been filled, the bytes can be packed into the final
integer which will be returned, this is done like the simple Ram would do, shifting
and adding (by means of an OR operation) to the final variable the source bytes
into a loop.
In order to check whether the virtual address given to the reading and writing
functions is allowed, that means, if it falls within the allowed range of the RingRam’s
addresses, the checkOutOfBound function compare the sum of the translated address
plus the size of the data to be read or written with the total size of the RingRam.
3 bool RingRam<Size>::checkOutOfBound(uint16_t address) const
4 {
5 if((address + sizeof(T)) > Size)
6 {
7 return true;
8 }
9 return false;
10 }
5.4 Memset method
As for the simple Ram, also for the RingRam, the memset method is just a loop
which call repeatedly the write method to write sequentially the provided data for
the given number of times.
2 inline void RingRam<Size>::memset(uint16_t base_address, uint8_t data, uint16_t
count)
30

5 – RingRam
3 {
4 for(uint16_t i = 0; i < count; ++i)
5 {
6 write(base_address+i, data);
7 }
8 }
5.5 Address translation method
The function which realize the address translation is actually very simple, the current
translation state, represented by the index variable, is added to the virtual address
passed as argument, then, in order to account for the address overﬂow and perform
the wrapping operation, the addition result is passed to the mod function which
performs the same operation as the mod operator of C++.
2 uint16_t RingRam<Size>::getActualAddress(uint16_t address) const
3 {
4 return mod<Size>(address + index);
5 }
The problem with this is that the mod operation is conceptually the reminder
of a division, but in hardware divisions are problematic and represent an obstacle,
so the mod operator of C++ (the %) cannot be used, the operation must be imple-
mented by hand in some way.
There exists a lot of mathematical methods to optimize a division, but these are
still too much complex and ineﬃcient for an hardware implementation.
Since the interest is just on the reminder of the division, the operation can be
implemented resorting on a loop which subtracts the divisor (the size of the Ram)
from the dividend (the sum of the address and the index variable) and stops only
when the result becomes lower than the divisor itself. The code is the following:
1 template<uint16_t size>
2 uint16_t mod(uint16_t n)
3 {
4 uint16_t temp = n;
5 while(temp >= size)
6 {
7 temp -= size;
8 }
31

5 – RingRam
9 return temp;
10 }
This function is used just in few very similar cases, specifically the divisor is
always known at compile time because it is always the size of a Ram, so the choice
was to pass it as a template parameter in case the synthesizer could be able to make
some kind of optimizations.
5.6 StepForward method
The update functions are two, one is called stepForward and takes as argument just
the data to be written over the oldest one, as always the data can be of whatever
size among the supported ones (but actually the implementation does support any
data size).
3 inline void RingRam<Size>::stepForward(T data)
4 {
5 write(0, data);
6 index = mod<Size>(index + sizeof(T));
7 }
The new data is written at the virtual address 0, which, as already explained
previously, because of its nature, points always at the oldest data. Then the index is
updated incrementing it by one, but as for the address translation, the incremented
value is also passed to the mod function in order to account for the overflow and
make it wrap if necessary.
5.7 DryStepForward method
The second update function is called dryStepForward and its use case is when the
buffer has to make a big step forward (many bytes at once) and there is no need
to write a specific value, so writing one byte at a time with random data just to
increment the index would be too time and power consuming.
This happens usually when a buffer is written with data only after some compu-
tation, some of this data is streamed out somewhere else, and the buffer has to
synchronize its addressing before a new computation cycle.
32

5 – RingRam
2 inline void RingRam<Size>::dryStepForward(uint16_t count)
3 {
4 index = mod<Size>(index + count);
5 }
The implementation is very simple, it’s just matter of incrementing the index by
the amount given by the only argument and wrapping if necessary by means of the
mod function.
33

Chapter 6
VirtualBuffer
Now that the RingRam layer is in place, on-demand update can be handled in a
very efficient way, but the addressing is not yet right, in fact the virtual addresses
managed by the RingRam are still relative ones, that means, does not corresponds
to actual absolute addresses of the buffer on the software side. Moreover every time
an update operation takes place, each data has a different address, the following
picture shows the concept:
Figure 6.1: RingRam data shift
This seems a downside, but is exactly why the RingRam layer exists and is ac-
tually very useful because serves the purpose of this third layer of abstraction, the
VirtualBuffer.
In order to synchronize the addressing with the software side buffer, it is obviously
34

6 – VirtualBuffer
needed to receive the initial offset through a suitable communication channel with
the cooperation of a software framework, which passes it to the communication
driver that will make it available to the hardware.
With this information is possible to further remap the addressing, shifting it to
the correct absolute starting address. To do this two variables are sufficient, one
which keeps track of the first valid address present into the VirtualBuffer, and an-
other one which keeps track of the last valid address present into the buffer.
Of course this layer will have to support the update operation too, which now
has the meaning of advancing through the addresses without being limited to the
underlying Ram size. While the RingRam addressing is limited to its size and has to
wrap to be consistent, the VirtualBuffer addressing doesn’t have to wrap anymore,
being free to advance over its nominal size limit, so being able to follow the software
side addressing but storing just the needed amount of data thanks to the RingRam
capabilities.
The following image will help to understand how addresses are remapped by the
VirtualBuffer to the RingRam:
Figure 6.2: VirtualBuffer addressing remapped
In this example each cell represents an absolute address from the higher level
of abstraction and the highlighted cell represents the RingRam’s address 0. The
VirtualBuffer translation remap the absolute address on the RingRam. At every
update operation the virtual buffer advances linearly while the RingRam wraps the
addresses when the end is reached.
In order to be more clear, it could be useful to see the whole architecture work-
ing together during two consecutive update operation showing both the addresses
and the data seen by each layer’s point of view.
35

Figure 6.3: Complete architecture update
The class interface is the following:
36

2 class VirtualBuffer
3 {
4 private:
5 MachineAddrType start;
6 MachineAddrType end; // Last valid address
7
8 RingRam<Size> ram;
9
10 bool exceptions[virtualbuffer_ex_total];
11
12 uint16_t getActualAddress(MachineAddrType address) const;
13 inline void throwException(virtualbuffer_ex e);
14
15 public:
16 VirtualBuffer();
17 inline void reset(MachineAddrType start_address = 0);
18
20 inline void write(MachineAddrType address, T data);
21
23 inline void write(AddrPtr<T> address, T data);
24
25
27 inline T read(MachineAddrType address, const T *retType);
28
30 inline T read(AddrPtr<T> address);
31
32
33 inline void memset(MachineAddrType base_address, uint8_t data, uint16_t
count);
34
36 inline void memset(AddrPtr<T> base_address, uint8_t data, uint16_t count);
37
38
40 inline void stepForward(T data);
41
42 inline void dryStepForward(uint32_t count);
43
45 bool checkOutOfBound(MachineAddrType address) const;
46
37

48 {
49 return ram.getSize();
50 }
51
52 inline MachineAddrType getStartAddress() const;
53 inline void setStartAddress(MachineAddrType start_address);
54
56 };
As usual it depends on the template parameter which defines the buffer size. In
the private section there are the two state variables start and stop which defines the
valid range of addresses for the buffer at a certain point in time, and are modified
during the update operation. There is also the instance of the associated RingRam
to which the Size template parameter is propagated.
The interface is quite similar to the RingRam’s one, with the exception for the
overloaded methods which accepts also an AddrPtr, this is now possible thanks to
the synchronization with the software side addresses.
6.1 Write methods
The write methods are very straightforward because performs the address trans-
lation and passes it to the RingRam layer which then will perform all the actions
previously explained. In the case of the AddrPtr overload it also extracts the address
from it.
3 inline void VirtualBuffer<Size>::write(MachineAddrType address, T data)
4 {
6 {
7 throwException(virtualbuffer_write_OutOfBound_ex);
8 return;
9 }
10
11 ram.write(getActualAddress(address), (T)data);
12 }
13
16 inline void VirtualBuffer<Size>::write(AddrPtr<T> address, T data)
17 {
18 write(address.get(), (T)data);
38

19 }
6.2 Read methods
The read methods are very similar and very simple too, the address is translated
and passed to the underlying layer.
The AddrPtr overload has a novelty though, that is, since the AddrPtr data type
already embed within itself (by means of the template parameter) the information
about the data type to be read, this can be avoided as an explicit argument in
the function signature, making the use of the function more natural and clean than
before.
Here is very useful the method of AddrPtr which returns a pointer to the templated
data type, because allows this overload to use the other one acting as a wrapper to
it.
3 inline T VirtualBuffer<Size>::read(MachineAddrType address, const T *retType)
4 {
6 {
7 throwException(virtualbuffer_read_OutOfBound_ex);
8 return 0;
9 }
10
11 return ram.read(getActualAddress(address), retType);
12 }
13
16 inline T VirtualBuffer<Size>::read(AddrPtr<T> address)
17 {
18 return read(address.get(), address.getPtr());
19 }
To check the address sanity the absolute address must be higher than the starting
address of the buﬀer, and the sum of the absolute address and the size of the data
being read or written do not exceed the ending address of the buﬀer.
39

3 bool VirtualBuffer<Size>::checkOutOfBound(MachineAddrType address) const
4 {
5 if(address < start)
6 {
7 return true;
8 }
9
10 if((address + sizeof(T) - 1) > end)
11 {
12 return true;
13 }
14
15 return false;
16 }
6.4 Memset methods
The memset methods are as straightforward as the write, the address is extracted
from the AddrPtr and the same address translation is performed.
2 inline void VirtualBuffer<Size>::memset(MachineAddrType base_address, uint8_t
data, uint16_t count)
3 {
4 ram.memset(getActualAddress(base_address), data, count);
5 }
6
9 inline void VirtualBuffer<Size>::memset(AddrPtr<T> base_address, uint8_t data,
uint16_t count)
10 {
11 memset(base_address.get(), data, count);
12 }
6.5 Address translation
Thanks to the behavior of the RingRam which makes its address rotate, the im-
plementation of the VirtualBuﬀer’s address translation is incredibly simple. The
buﬀer’s start address has just to be subtracted from the absolute address provided
40

as argument.
This is sufficient to make the VirtualBuffer to synchronize with RingRam’s address-
ing so that everything work as intended.
2 uint16_t VirtualBuffer<Size>::getActualAddress(MachineAddrType address) const
3 {
4 return (uint16_t)(address - start);
5 }
6.6 Starting address
The starting address of the buffer is an essential information which is useful to be
obtained from the buffer itself during the execution, so there is a method to retrieve
it.
2 inline MachineAddrType VirtualBuffer<Size>::getStartAddress() const
3 {
4 return start;
5 }
Moreover there are situations in which is useful to set it, for example during the
initial reset, or at the beginning of a new computation phase.
For these situations a method is provided instead to let the user to freely modify the
variable’s value (always in accord with the OOP data hiding principle), because every
change of the starting value must be immediately followed by (or in concurrency
with) an update of the end variable, depending on the Size value.
Not doing so can easily lead to inconsistencies in the address handling, so it is better
to constrain the writings to the variable.
2 inline void VirtualBuffer<Size>::setStartAddress(MachineAddrType start_address)
3 {
4 start = start_address;
5 end = start_address + Size - 1;
6 }
41

6.7 StepForward method
At this level, the VirtualBuffer, does not have to actually writes anything to execute
a stepForward, it simply passes the data to the underlying layer, the RingRam, which
will take care of writing at the appropriate physical address the received data.
The VirtualBuffer itself simply update its internal state to reflect the linear advance
in the absolute addressing, by incrementing the start and end by the size of the new
data being written.
3 inline void VirtualBuffer<Size>::stepForward(T data)
4 {
5 ram.stepForward(data);
6 start += sizeof(T);
7 end += sizeof(T);
8 }
6.8 DryStepForward method
As for the RingRam’s corresponding method, the dryStepForward function makes
the buffer to advance many bytes at once without writing any particular value in
the memory. First the underlying layer’s function is called passing the amount of
bytes to advance as an argument, then the VirtualBuffer’s internal state variables
are updated incrementing them by the same bytes count.
2 inline void VirtualBuffer<Size>::dryStepForward(uint32_t count)
3 {
4 ram.dryStepForward(count);
5 start += count;
6 end += count;
7 }
42

Chapter 7
Exception Handling
Especially during debug phases it is useful to being able detect runtime errors, for
this purpose C++ support exceptions, but these are not supported by synthesizers
and trying to build a custom solution allows to understand why.
Placing some global flags to set whenever there is an error is not a viable option
because global variables are not supported during synthesis (but this would be man-
ageable) and more importantly usually there are more instances of the buffer, so a
set of flags would have to be created for every instance.
The problem with this is that it does not scale well, because the code of the buffer
itself have to be modified whenever the number of instances changes, also, the so-
lution is not self-contained, that means, it is not possible to bundle it within the
buffer’s code.
Another solution could be to insert some public flags to be read in order to check
for exceptions during the execution.
The problem with this solution is that the layered structure of the architecture hides
the complexity of the lower levels, so only the flags of the higher level would be ac-
cessible.
A function which attach to a custom structure an integer code which represents the
exception and passes it to the upper layer could solve the problem, but this cannot
be done statically, in order to implement this solution a pointer would have to be
used in order to create and to pass this structure.
But dynamic memory allocation is not supported in synthesis so every implementa-
tion that pass a structure populated with the error codes cannot be used.
Another possibility could be to copy the flags in a custom structure and return it
by value, but this would require every class to know the structure’s implementation
of the underlying layers.
43

7 – Exception Handling
7.1 Proposed solution
A possible way to dynamically attach an information to the one coming from a lower
level and passes it to the upper level is to represent the exceptions with flags but
pack them into an integer, then implement a function to retrieve the exceptions that
gets the flags of the lower levels and then attach the flags of the current level by
shifting left the old ones and inserting the new ones, as showed by the image.
Figure 7.1: Exceptions
This solution has also the nice properties of being self-contained, so being easy
to bundle with the classes and being very easy to handle algorithmically so scaling
smoothly as other exceptions are implemented without the need of modifying the
code (exceptions as integer codes instead of an array of booleans wouldn’t have
allowed it).
The only limitation is that number of exceptions is limited by the size of the integer
44

which is passed, but since usually synthesizers are able to tailor the effective bit-
width of the signals to the right size (if everything is statically determinable), a
bigger integer can simply be used in the code, leaving the synthesizer to optimize it.
7.1.1 Exceptions encoding
The choice is to implement the exception flags as a boolean array, every layer has
its own array with the flags having their own meanings. In order to formally encode
these meanings, to both assign a label and make more easy to add other exceptions
when needed, the C++ construct enumeration is used.
1 enum ram_ex
2 {
3 ram_write_OutOfBound_ex = 0,
4 ram_read_OutOfBound_ex,
5 ram_ex_total
6 };
7
8 enum ringram_ex
9 {
10 ringram_write_OutOfBound_ex = 0,
11 ringram_read_OutOfBound_ex,
12 ringram_ex_total
13 };
14
15 enum virtualbuffer_ex
16 {
17 virtualbuffer_write_OutOfBound_ex = 0,
18 virtualbuffer_read_OutOfBound_ex,
19 virtualbuffer_ex_total
20 };
21
22 enum fifo_ex
23 {
24 fifo_put_full_ex = 0,
25 fifo_get_empty_ex,
26 fifo_ex_total
27 };
One enumeration per layer is created, into each enumeration there are the ex-
ception labels. Forcing the first to begin with the number 0 (otherwise the starting
number is undefined and could change with different implementations) it is possible
to directly use the labels to address the exceptions inside the flag arrays.
The last label instead, always represents the total number of labels inside the enu-
meration, and so the total number of flags of the layer, because of this the last label
45

is used to declare the size of the flag arrays. This simplifies a lot the implementa-
tion of new exceptions because it is simply matter of adding another label in the
penultimate position (the last is always reserved for the total number of labels).
Actually there isn’t any particular problem changing the labels order (except the
last of course), but only if also the software which reads the error code uses the same
header file to handle the labels meaning.
7.1.2 Exceptions throwing
Thanks to the formal label implementation the exception throwing becomes ex-
tremely easy:
2 inline void VirtualBuffer<Size>::throwException(virtualbuffer_ex e)
3 {
4 exceptions[e] = true;
5 }
When an exception is caught by the layer’s code, it calls the throwException
function passing as an argument the corresponding exception label, which actually
is encoded as a number that is used to address the flags array of the layer in order
to set the correspondent flag to true.
The good thing of having implemented different enumerations, one for every layer,
is that the compiler is also able to statically check if the label used belongs to the
right enumeration, because the function’s argument make the enumeration explicit.
7.1.3 Exceptions retrieving
The code for getting the exceptions can be divided into two distinct groups: the
lowest layer, and the other layer.
The code of the lowest layer is the following:
2 uint8_t Ram<Size>::getException()
3 {
4 uint8_t temp = 0;
5 if(exceptions[0])
6 {
7 temp |= 1;
8 exceptions[0] = false;
9 }
10
46

11 for(int i = 1; i < ram_ex_total; ++i)
12 {
13 temp = temp << 1;
14 if(exceptions[i])
15 {
16 temp |= 1;
17 exceptions[i] = false;
18 }
19 }
20 return temp;
21 }
The code of the first layer is different because it must initialize the integer (to 0),
then the first flag is inserted separately in the first if clause, then a loop implement
the generic logic which can scale to any number of exceptions thanks to the last
enumeration’s label, which defines how long the loop must last.
This loop takes the temporary variable and every cycle shift it left by one bit, then
if the flag is set, the least significant bit is also set to 1, and the flag is cleared, in
order to be immediately able to catch another exception, the user could forget to
do it so loosing other errors.
Once the loop finishes the variable is returned to the caller with all the appropriate
flags set.
The code for the other layers instead is this one:
2 inline uint8_t VirtualBuffer<Size>::getException()
3 {
4 uint8_t temp = ram.getException();
5 for(int i = 0; i < virtualbuffer_ex_total; ++i)
6 {
7 temp = temp << 1;
8 if(exceptions[i])
9 {
10 temp |= 1;
11 exceptions[i] = false;
12 }
13 }
14 return temp;
15 }
First the exceptions from the underlying layers are retrieved calling the getEx-
ception method, in this way a chain is formed, from the lowest level up to the higher
47

called by the user. The exceptions are saved into a temporary variable.
Then the loop’s logic is the same, the variable is shifted and if the correspondent
flag is set, also the lsb of the variable is set, then the flag is cleared.
Once the loop finishes the temporary variable is returned either to the upper layer
or the user itself.
This modeling style with loops seems to be inefficient with respect to what would
be implemented in HDL (hardware description language), because implementing a
loop is way more complex and area consuming than a simple wire arrangement that
would be sufficient in HDL.
But since everything is statically determinable, the synthesizer is able to handle and
optimize the loop, specifically it is able to perform a full loop unrolling operation
which allows a strong optimization that should give a near-handwritten quality of
result.
In this way the loop is just a way to handle any number of exceptions algorith-
mically, that means, without having to manually add an assignment for every new
exception implemented.
48

Chapter 8
Integration
Now that the architecture is complete it can be integrated into the algorithm, but
before being able to do this, one thing is still missing: a system level description of
the data input and output mechanism.
The choice falls on the most standard way of exchanging data with other hardware
modules, the FIFO.
A FIFO allows the independence on the source of data, that now can be another
computation module, as well as a communication bus that receives data from an AXI
bus, or any other kind of bus. This makes also easy, if needed, to put the module
in a different clock domain in order to, for instance, trading off performances and
power consumption.
8.1 System overview
Recalling the case study, the inputs are the two derivatives and the outputs are the
data buffer and the stack’s data. But in the previous chapters, other inputs and
outputs were defined, specifically: an input for the parameters of the function call
and an output for the runtime exceptions signaling.
The input parameters include the arguments of the Canny algorithm and more
importantly the base absolute address of the data buffer.
So the final specifications are the following:
• Inputs
– Parameters
– Dx
– Dy
49

8 – Integration
• Outputs
– Buﬀer
– Stack
– Exceptions
This means that in order to be able to run a functional simulation, also a model
of the FIFO must be developed.
The general system scheme is summarized into the following image.
Figure 8.1: System
50

8 – Integration
8.2 FIFO model
Although the FIFO is used only in simulation its model has still to be developed, for
this the same style of modeling will be used in order to maintain the code consistency.
1 template <typename DataType, int ElemCount>
2 class Fifo
3 {
4 private:
5 Ram<((ElemCount+1) * sizeof(DataType))> ram;
6 int in;
7 int out;
8
9 public:
10 Fifo();
11
12 DataType get();
13 void put(DataType data);
14
15 bool isEmpty() const;
16 bool isFull() const;
17 };
The template parameters are two, the first defines the data type which will be
handled by the FIFO, the second defines the maximum number of elements (of type
DataType) that can be stored into the FIFO at the same time.
As underlying storage the Ram model will be used. The storage for the internal way
of working of the FIFO has to be able to store one element more than the nominal
size of the FIFO itself.
That one more element is used to distinguish the empty condition from the full
condition by keeping it always empty.
The class interface supports the standard and well known FIFO functions:
• Get()
• Put()
• isEmpty()
• isFull()
51

8 – Integration
8.2.1 isEmpty method
Among the possible different types of implementation, the chosen one is to define
the empty condition as the equality of the two state variables (in and out). This
choice will consequently define also the rest of the implementation.
2 bool Fifo<DataType, ElemCount>::isEmpty() const
3 {
4 if(in == out)
5 {
6 return true;
7 }
8 return false;
9 }
8.2.2 isFull method
Since the empty condition is identified with the equality of the state variables, for the
full condition, another way must be used to detect it. This could be to check whether
the in variable is equal to the always-empty element, that is the one preceding the
element pointed by the out state variable.
2 bool Fifo<DataType, ElemCount>::isFull() const
3 {
4 if( in == ((out - sizeof(DataType) + ram.getSize()) % ram.getSize()) )
5 {
6 return true;
7 }
8 return false;
9 }
8.2.3 Put method
Before putting something in the FIFO, it must be checked if there is enough space
into the underlying storage, if this is the case, the data is written into the Ram at
the address defined by the in variable.
Once the data is written, the in variable must be updated to point to the next free
location, by incrementing the value by the size of the DataType and if necessary
wrapping it thanks to the mod operator, as shown in the code.
52

8 – Integration
2 void Fifo<DataType, ElemCount>::put(DataType data)
3 {
4 if(isFull())
5 {
6 return;
7 }
8
9 ram.write(in, data);
10
11 in = (in + sizeof(DataType)) % ram.getSize();
12 }
8.2.4 Get method
In a very similar, but dual, way the get method checks if the FIFO is empty, if it is
then the function returns the conventional value 0. If the FIFO is not empty, the
data is retrieved from the storage at the address pointed by the out variable and
saved into a temporary variable, because before returning it the out variable must
be updated.
This is done incrementing the variable by the size of the DataType and wrapping if
necessary. Once this operation is completed, the data can be returned.
2 DataType Fifo<DataType, ElemCount>::get()
3 {
4 if(isEmpty())
5 {
6 return 0;
7 }
8
9 DataType temp;
10 temp = ram.read(out, &temp);
11
12 out = (out + sizeof(DataType)) % ram.getSize();
13
14 return temp;
15 }
53

8 – Integration
8.3 Helper functions
Now all the models are implemented, the following image shows a block diagram of
the framework’s layers.
Figure 8.2: Block diagram
But before starting to integrate it into the algorithm, other helper functions can
be implemented in order to keep everything more organized.
8.3.1 Constants
First a class encapsulating all the needed constants can be useful:
1 class CannyConst
2 {
3 public:
4 static const unsigned int cols = 640;
5 static const unsigned int rows = 480;
6
7 static const unsigned int size_parameters = sizeof(CannyParameters);
8 static const unsigned int size_dx = (640 * 480 * 2);
9 static const unsigned int size_dy = (640 * 480 * 2);
10 static const unsigned int size_magbuffer = ((640+2)*3*sizeof(int));
11 static const unsigned int size_buffer = ((640+2)*(480+2));
12 static const unsigned int size_stack = 50000;
13
14 static const unsigned int step_size_dx = (640 * 2);
15 static const unsigned int step_size_dy = (640 * 2);
16 static const unsigned int step_size_buffer = (640);
17
18 static const unsigned int step_mul_dx = 3;
54

8 – Integration
19 static const unsigned int step_mul_dy = 3;
20 static const unsigned int step_mul_buffer = 4;
21
22 static const unsigned int vb_size_dx = (step_size_dx * step_mul_dx);
23 static const unsigned int vb_size_dy = (step_size_dy * step_mul_dy);
24 static const unsigned int vb_size_magbuffer = size_magbuffer;
25 static const unsigned int vb_size_buffer = ((step_size_buffer+2) *
step_mul_buffer);
26 };
The first two constants represent the sizes of the image, the second group instead
represents the total size of the buffers, this group is more used in simulation than in
synthesis, except for the first constant of the group which will be explained shortly.
The third group defines the size of one row in bytes, that once multiplied by the
fourth group, which defines the number of rows for each buffer, gives the last group
of constants that represent the sizes of each hardware buffer.
8.3.2 Parameters container
Then a custom structure can be implemented to encapsulate all the algorithm’s
parameters:
1 struct CannyParameters
2 {
3 int32_t low;
4 int32_t high;
5 uint32_t L2gradient;
6 uint32_t mapstep;
7 MachineAddrType base_address;
8 };
This structure will be instantiated and populated in the software domain, then
transmitted through a FIFO to the hardware domain, where there will be another
instance in which every field will be written reading from the FIFO.
8.3.3 Helper functions
The custom structure can be used both in simulation and in synthesis and can be
also coupled to a function which takes care of writing it to the FIFO:
55

8 – Integration
1 void write_out(Fifo<uint32_t, sizeof(CannyParameters)> *fifo, const
CannyParameters& buffer)
2 {
3 fifo->put(buffer.low);
4 fifo->put(buffer.high);
5 fifo->put(buffer.L2gradient);
6 fifo->put(buffer.mapstep);
7 fifo->put(buffer.base_address);
8 }
And also to another function which gets the parameters from the FIFO and
populates the hardware counterpart of the structure:
1 void populateCannyParameters(CannyParameters *par, Fifo<uint32_t,
sizeof(CannyParameters)> *fifo)
2 {
3 par->low = fifo->get();
4 par->high = fifo->get();
5 par->L2gradient = fifo->get();
6 par->mapstep = fifo->get();
7 par->base_address= fifo->get();
8 }
A similar function can be implemented for the derivatives to write them in the
FIFOs:
1 void write_out(Fifo<uint32_t, CannyConst::size_dx> *fifo, const Mat& mat)
2 {
3 int channels = mat.channels();
4 int nRows = mat.rows;
5 int nCols = mat.cols * channels * 2; // dx and dy has size short
6
7 if (mat.isContinuous())
8 {
9 nCols *= nRows;
10 for(int i = 0; i < nCols; i += sizeof(uint32_t))
11 {
12 fifo->put( *((uint32_t *)(mat.data+i)) );
13 }
14 }
15 else
16 {
17 for(int j = 0; j < nRows; ++j)
18 {
56

8 – Integration
19 const uint32_t *p = mat.ptr<uint32_t>(j);
20 for(int i = 0; i < (nCols/4); ++i)
21 {
22 fifo->put(p[i]);
23 }
24 }
25 }
26 }
This function takes an OpenCV Mat object and put it into the FIFO in chunks
of 32-bits at a time (common size for the communication bus). Since the data inside
the Mat object can also be not continuous (but actually it is never the case), the
two cases are handled separately.
The same can be done for importing the data from the output FIFOs back to the
software buffers for the next sections of the algorithm that will be executed by the
CPU.
One function for the data buffer which is a simple loop that gets data from the FIFO
and writes it to the software buffer’s memory:
1 void read_in(Fifo<uint32_t, CannyConst::size_buffer> *fifo, uchar *buf, unsigned
int size)
2 {
3 for(unsigned int i = 0; i < size; i += 4)
4 {
5 uint32_t temp = fifo->get();
6 *((uint32_t *)(buf + i)) = temp;
7 }
8 }
And another for the Stack, which is a bit different, because while for the others
the length was well known, for this it is not, because the number of pixel’s addresses
pushed onto the Stack depends on the actual number of edges present into the input
image, so it cannot be known in advance.
This means that the loop which reads from the FIFO must rely on the empty signal
to understand when there is no more data. But this is valid only in the simulation,
where first all data will be pushed, and only after that it will be read from the FIFO.
During the actual running on hardware, another solution must be used to signal the
algorithm’s end, because everything is concurrent and the FIFO can get empty also
when the computation is not finished.
57

8 – Integration
1 inline unsigned int read_in(Fifo<uint32_t, CannyConst::size_stack> *fifo,
std::vector<uchar*>& stack, unsigned int size)
2 {
3 unsigned int i = 0;
4 for(i = 0; i < size; i += 1)
5 {
6 if(fifo->isEmpty())
7 {
8 return i;
9 }
10 else
11 {
12 uint32_t temp = fifo->get();
13 *((uint32_t *)(&stack[i])) = temp;
14 }
15 }
16 return i;
17 }
Once the reading ends, the elements count is returned to correctly set the
stack top pointer.
Bus assumption: From now on, the assumption on the bus functioning will be
that there are separate channels, one for every FIFO, which receives data from a
correspondent device file located into the /dev directory.
Under this assumption (and assuming suitable functions to handle it have been
developed overloading the functions already presented) it is possible to further or-
ganize things, a communication helper class can be implemented.
This class defines high level methods for writing and reading data of the inputs
and outputs of the system, allowing to use a simple flag to state if the class should
redirect the writings and readings to FIFOs or device files, minimizing in this way
the code differences between the simulation and the actual implementation.
The class interface is the following:
1 class CannyHandler
2 {
3 public:
4 CannyHandler(bool use_fifo = false);
5 ~CannyHandler();
6
58

8 – Integration
7 void writeParameters(const CannyParameters& buffer);
8 void writeDx(const Mat& mat);
9 void writeDy(const Mat& mat);
10
11 void readBuffer(uchar *buf);
12 unsigned int readStack(std::vector<uchar*>& stack);
13
14 private:
15 bool fifo_flag;
16
17 static ofstream file_parameters;
18 static ofstream file_dx;
19 static ofstream file_dy;
20 static ifstream file_buffer;
21 static ifstream file_stack;
22
23 Fifo<uint32_t, CannyConst::size_parameters> *fifo_out_parameters;
24 Fifo<uint32_t, CannyConst::size_dx> *fifo_out_dx;
25 Fifo<uint32_t, CannyConst::size_dy> *fifo_out_dy;
26
27 Fifo<uint32_t, CannyConst::size_buffer> *fifo_in_buffer;
28 Fifo<uint32_t, CannyConst::size_stack> *fifo_in_stack;
29 };
The important thing of this class is the fifo flag variable which will tell to every
function how to manage the operations, it must be correctly set when the class is
instantiated.
The constructor, depending on the flag argument, will instance the needed FIFOs
or not, if the choice is to open the files, this cannot be done in the constructor, it
must be done statically into the .cpp file.
1 CannyHandler::CannyHandler(bool use_fifo)
2 {
3 if(use_fifo)
4 {
5 fifo_flag = true;
6
7 fifo_out_parameters = new Fifo<uint32_t, CannyConst::size_parameters>();
8 fifo_out_dx = new Fifo<uint32_t, CannyConst::size_dx>();
9 fifo_out_dy = new Fifo<uint32_t, CannyConst::size_dy>();
10
11 fifo_in_buffer = new Fifo<uint32_t, CannyConst::size_buffer>();
12 fifo_in_stack = new Fifo<uint32_t, CannyConst::size_stack>();
13 }
14 else
15 {
59

8 – Integration
16 fifo_flag = false;
17 }
18 }
Then every function, depending on the internal flag, chose the right overload of
the helper functions and pass as arguments the buffers and the right FIFO or file
stream.
1 void CannyHandler::writeParameters(const CannyParameters& buffer)
2 {
3 if(fifo_flag)
4 {
5 write_out(fifo_out_parameters, buffer);
6 }
7 else
8 {
9 write_out(file_parameters, buffer);
10 }
11 }
12
13 void CannyHandler::writeDx(const Mat& mat)
14 {
15 if(fifo_flag)
16 {
17 write_out(fifo_out_dx, mat);
18 }
19 else
20 {
21 write_out_D(file_dx, mat);
22 }
23 }
24
25 void CannyHandler::writeDy(const Mat& mat)
26 {
27 if(fifo_flag)
28 {
29 write_out(fifo_out_dy, mat);
30 }
31 else
32 {
33 write_out_D(file_dy, mat);
34 }
35 }
36
37
38 void CannyHandler::readBuffer(uchar *buf)
39 {
60

8 – Integration
40 if(fifo_flag)
41 {
42 read_in(fifo_in_buffer, buf, CannyConst::size_buffer);
43 }
44 else
45 {
46 read_in(file_buffer, buf, CannyConst::size_buffer);
47 }
48 }
49
50 unsigned int CannyHandler::readStack(std::vector<uchar*>& stack)
51 {
52 if(fifo_flag)
53 {
54 return read_in(fifo_in_stack, stack, CannyConst::size_stack);
55 }
56 else
57 {
58 return read_in(file_stack, stack);
59 }
60 }
61

8 – Integration
8.4 Systematic integration procedure
The framework is now complete and is summarized into this UML class diagram:
Figure 8.3: Class diagram
8.4.1 The surroundings
The first thing that has to be done is to define two cuts into the original code. This
had already been done in the chapter about the case study in order to define the
inputs and outputs, these cuts are before and after the selected big loop.
First cut
Before the cut these actions has to be performed:
• Move the declarations of the software correspondent of the output buffers
above the first cut and all the algorithm’s variables below.
• Perform all the preliminary operations like the two derivatives.
• Create and populate the parameters structure and instantiate the helper class.
62

8 – Integration
• Write all input buﬀers into the FIFOs.
The declaration of the buﬀers and the preliminary operations remain basically
untouched, are only to be moved above the cut:
1 const int type = _src.type(), depth = CV_MAT_DEPTH(type), cn = CV_MAT_CN(type);
2 const Size size = _src.size();
3
4 CV_Assert( depth == CV_8U );
5 dst.create(size, CV_8U);
6
7 if (!L2gradient && (aperture_size & CV_CANNY_L2_GRADIENT) ==
CV_CANNY_L2_GRADIENT)
8 {
9 // backward compatibility
10 aperture_size &= ~CV_CANNY_L2_GRADIENT;
11 L2gradient = true;
12 }
13
14 if ((aperture_size & 1) == 0 || (aperture_size != -1 && (aperture_size < 3 ||
aperture_size > 7)))
15 CV_Error(CV_StsBadFlag, "");
16
17 if (low_thresh > high_thresh)
18 std::swap(low_thresh, high_thresh);
19
20 Mat src = _src.getMat(), dst = _dst.getMat();
21
22 Mat dx(src.rows, src.cols, CV_16SC(cn));
23 Mat dy(src.rows, src.cols, CV_16SC(cn));
24
25 Sobel(src, dx, CV_16S, 1, 0, aperture_size, 1, 0, BORDER_REPLICATE);
26 Sobel(src, dy, CV_16S, 0, 1, aperture_size, 1, 0, BORDER_REPLICATE);
27 if (L2gradient)
28 {
29 low_thresh = std::min(32767.0, low_thresh);
30 high_thresh = std::min(32767.0, high_thresh);
31
32 if (low_thresh > 0) low_thresh *= low_thresh;
33 if (high_thresh > 0) high_thresh *= high_thresh;
34 }
35 int low = cvFloor(low_thresh);
36 int high = cvFloor(high_thresh);
37
38 CV_Assert( cn == 1 );
39 MachineAddrType mapstep = src.cols + 2;
40 uchar buffer[((src.cols+2)*(src.rows+2) + mapstep * 3 * sizeof(int))];
41
63

8 – Integration
42 int maxsize = CannyConst::size_stack;
43 std::vector<uchar*> stack(maxsize);
44 uchar **stack_top = &stack[0];
45 uchar **stack_bottom = &stack[0];
The parameters structure is declared and populated with all the needed param-
eters taken from the arguments or the preliminary computed values, in particular
the base address of the data buffer is copied.
Then also the CannyHandler helper class is instantiated enabling the flag for using
the FIFOs.
1 CannyParameters sw_par = CannyParameters();
2 sw_par.L2gradient = L2gradient;
3 sw_par.mapstep = mapstep;
4 sw_par.base_address = reinterpret_cast<MachineAddrType> (buffer);
5
6 CannyHandler handler = CannyHandler(true);
The last step is to write out on the FIFOs the input buffers, thanks to the helper
class methods this is now very simple and the code is very clean. Moreover the code
for the actual implementation can be exactly the same, the only change required is
to switch the argument in the handler’s constructor in order to use the device files.
1 handler.writeParameters(sw_par);
2 handler.writeDx(dx);
3 handler.writeDy(dy);
After the cut these other actions has to be done to initialize everything in the
hardware domain:
• The virtual buffers has to be instantiated along with the hardware counterpart
of the of the parameters structure.
• Import the parameters from the bus channel and populate the structure.
• Set the starting address of the virtual buffers using the parameters data.
• Fill the input buffers reading the data from the respective bus channels.
The first three steps are quite easy thanks to the function developed previously:
64

8 – Integration
1 VirtualBuffer<MachineAddrType, CannyConst::vb_size_dx> vb_dx =
VirtualBuffer<uint32_t, CannyConst::vb_size_dx>();
2 VirtualBuffer<MachineAddrType, CannyConst::vb_size_dy> vb_dy =
VirtualBuffer<uint32_t, CannyConst::vb_size_dy>();
3 VirtualBuffer<MachineAddrType, CannyConst::vb_size_magbuffer> vb_magbuffer =
VirtualBuffer<uint32_t, CannyConst::vb_size_magbuffer>();
4 VirtualBuffer<MachineAddrType, CannyConst::vb_size_buffer> vb_buffer =
VirtualBuffer<uint32_t, CannyConst::vb_size_buffer>();
5
6 CannyParameters hw_par = CannyParameters();
7 populateCannyParameters(&hw_par, handler.getFifoParameters());
8
9 vb_magbuffer.setStartAddress(hw_par.base_address);
10 vb_buffer.setStartAddress(CannyConst::size_magbuffer+hw_par.base_address);
The last step requires two loops which read from the right FIFO and fill the
respective buffers (keep in mind that the VirtualBuffers are just a portion of the total
size of the software correspondent, all these sizes are defined into the CannyConst
class).
1 for(MachineAddrType step_i = 0; step_i < CannyConst::vb_size_dx; step_i +=
sizeof(uint32_t))
2 {
3 vb_dx.write(step_i, handler.getFifoDx()->get());
4 }
5
6 for(MachineAddrType step_i = 0; step_i < CannyConst::vb_size_dy; step_i +=
sizeof(uint32_t))
7 {
8 vb_dy.write(step_i, handler.getFifoDy()->get());
9 }
Second cut
Before the cut everything from the output buffers has to be streamed out if not
already done:
Specifically the remaining data into the data buffer has to be finally flushed. Since
the Stack’s data is directly sent to the FIFO whenever an edge is found, there is
nothing remained to be streamed out.
1 for(MachineAddrType step_i = vb_buffer.getStartAddress(); step_i <
(hw_par.base_address+CannyConst::size_buffer+CannyConst::vb_size_magbuffer);
step_i += sizeof(uint32_t))
65

8 – Integration
2 {
3 handler.getFifoBuffer()->put(vb_buffer.read(step_i, &p));
4 }
After the cut the output buﬀers has to be read and stored into their correspond-
ing software counterparts:
1 handler.readBuffer((uchar *)buffer + CannyConst::size_magbuffer);
2 unsigned int stack_size = handler.readStack(stack);
3 stack_top = &stack[0] + stack_size;
Also, the last information missing has to be handled, the Stack’s size must be
used to correctly set the top pointer.
8.4.2 The actual algorithm
Pointers
Now that the input and output sections are completed, the core algorithm can be
managed. The required actions are not complex, the fundamental step is to convert
all the pointers to the AddrPtr type, for example in the original code there are these
declarations:
1 int* mag_buf[3];
2 mag_buf[0] = (int*)(uchar*)buffer;
3 mag_buf[1] = mag_buf[0] + mapstep;
4 mag_buf[2] = mag_buf[1] + mapstep;
This is an array of pointers and is fundamental for the algorithm’s functioning,
but such a structure is absolutely not supported by the HLS tools, because imply
a second level of indirection. Converting the pointer type to the AddrPtr type this
becomes then manageable because internally to the AddrPtr class, the address is
not represented as a pointer, instead it is represented as a simple integer, so it is
allowed to create arrays of the class. The converted declaration is this:
1 AddrPtr<int>mag_buf[3];
2 mag_buf[0] = AddrPtr<int>(hw_par.base_address);
3 mag_buf[1] = mag_buf[0] + hw_par.mapstep;
4 mag_buf[2] = mag_buf[1] + hw_par.mapstep;
66

8 – Integration
Thanks to the redefinition of the assignment operator in the AddrPtr class the
code conversion is straightforward and requires just to read the parameters from the
custom structure’s instance.
The same must be done for every pointer in the code, following there are some
examples extracted from the original code where some more complex declarations
are done:
1 uchar* map = (uchar*)(mag_buf[2] + mapstep*cn);
2
3 int* _norm = mag_buf[(i > 0) + 1] + 1;
4
5 short* _dx = dx.ptr<short>(i);
6
7 uchar* _map = map + mapstep*i + 1;
8
9 int* _mag = mag_buf[1] + 1;
And here the correspondent conversion for each one of them.
1 AddrPtr<uint8_t> map = AddrPtr<uint8_t>((mag_buf[2] + hw_par.mapstep));
2
3 AddrPtr<int> _norm = AddrPtr<int>((mag_buf[(i > 0) + 1] + 1));
4
5 AddrPtr<short> _dx = AddrPtr<short>((cols * sizeof(short) * i));
6
7 AddrPtr<uint8_t> _map = AddrPtr<uint8_t>(map + (hw_par.mapstep*i + 1));
8
9 AddrPtr<int> _mag = AddrPtr<int>((mag_buf[1] + 1).get());
Also the redefinition of the subtraction operator for pointers difference proves to
be useful, the original code:
1 ptrdiff_t magstep1 = mag_buf[2] - mag_buf[1];
2 ptrdiff_t magstep2 = mag_buf[0] - mag_buf[1];
Becomes after the conversion:
1 int magstep1 = mag_buf[2] - mag_buf[1];
2 int magstep2 = mag_buf[0] - mag_buf[1];
67

High Level Synthesis of Algorithms with Pointers

High Level Synthesis of Algorithms with Pointers

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to High Level Synthesis of Algorithms with Pointers

Similar to High Level Synthesis of Algorithms with Pointers (20)

High Level Synthesis of Algorithms with Pointers