SlideShare a Scribd company logo
1 of 27
Download to read offline
March 26, 2016
..
Real-time applications on Intel Xeon/Phi
Karel Ha
CERN High Throughput Computing collaboration
Summary:
The Intel Xeon/Phi platform is a powerful x86 multi-core engine with a very high-speed
memory interface. In its next version it will be able to operate as a stand-alone system with a
very high-speed interconnect. This makes it a very interesting candidate for (near) real-time
applications such as event-building, event-sorting and event preparation for subsequent
processing by high level trigger software algorithms.
Real-time applications on Intel Xeon/Phi 1
March 26, 2016
Abstract
The following document is a report providing the first results on the performance of In-
tel Xeon Phi computing accelerator in the context of LHCb Online Data Acquisition system
(DAQ).
Themainfocusisputintotheevent-sortingtask: whendataarrivefromdifferentsources
corresponding to different parts of the LHCb detector, they are grouped by the source,
from which they originate. In the next stage of DAQ, it is necessary to make a decision,
whether to store the given collision event or not. For this purpose, it is more convenient to
group the data by their memberships to collision events (i.e. all data from one collision need
to be placed together), so that the DAQ system can decide based on the “whole picture” of
one event.
The Xeon Phi is an interesting candidate for event-sorting task. It offers a large number
of cores and vast amount of memory. Furthermore, this task can also be very well paral-
lelized, which can make it especially suitable for the many-core architecture of the Xeon
Phi. Thus, this report may be used to study feasibility of the Intel Xeon Phi platform for the
next upgrade of the LHCb detector in 2018-2019.
Real-time applications on Intel Xeon/Phi 2
March 26, 2016
Contents
1 Introduction 4
1.1 Description of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 The goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Offload-bandwidth 8
3 Prefix-offset 9
4 Event-sort 10
4.1 The distribution of iteration durations . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Comparison between event-sort and raw memcpy . . . . . . . . . . . . . . . . . . 12
4.3 Blockschemes for memcpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 ASLR on KNC and its effect on event-sort . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Fixation of input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 Varying of number of copy-threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Some ideas for future work 22
6 Conclusion 23
Appendix A Infrastructure 24
Appendix B Compilers 25
Appendix C Reproducing the event-sort results 25
C.1 Source code and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
C.2 Offload-bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
C.3 Prefix-offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
C.4 Event-sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Real-time applications on Intel Xeon/Phi 3
March 26, 2016
1 INTRODUCTION
Intel Xeon Phi or Intel Many Integrated Core Architecture (MIC) is a promising x86 many-
core computing accelerator. As such, it is suitable for highly parallelizable jobs such as event-
sorting, a subtask of LHCb Data Acquisition System (DAQ). In this report, we present our mea-
surementsofevent-sortingonIntelXeonPhicard, specifically“KnightsCorner”(KNC)version.
There are 3 demo programs:
• offload-bandwidth
• prefix-offset
• event-sort
Thefirsttwo partsserveas preliminarytoolsfor baseline benchmarksandtesting theprop-
erties of Xeon Phi, whereas the last one simulates the real conditions of event-sort in LHCb
DAQ.
For details on the used software and hardware, consult Appendix C. There are also the in-
structions for reproducing the results.
There is also a shared CERNBox folder htcc_shared, which contains all the logs that I regu-
larly kept during my internship. For full details (source codes, bash and gnuplot scripts, figures,
raw output files and results etc.), acquire an access to the shared folder and consult my logs.
1.1 DESCRIPTION OF THE PROBLEM
The LHCb detector at CERN is a complex instrument consisting of many subdetectors. Hence,
there are also many (approximately 1000) sources of input channels for the DAQ system. Each
of the readout boards keeps the fragments of information (so called MEP fragments or also
mep_contents in the source code) in its own buffer. The fragments come from different chan-
nels and different collisions. The number of collisions is called MEP factor (by default 10000
fragments per source).
For further processing, however, it is much more favorable to re-arrange (transpose) the
fragments and group them together according to the collision they belong to:
Real-time applications on Intel Xeon/Phi 4
March 26, 2016
FIGURE 1: TRANSPOSE OF FRAGMENTS
For better illustration, see the example below:
−−−−−−−−−−Input MEP contents−−−−−−−−−−
Source #0 111222333334444
Source #1 555566667777788888
Source #2 9999aaaaabbbcc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−Output MEP contents−−−−−−−−−
C o l l i s i o n #0 11155559999
C o l l i s i o n #1 2226666aaaaa
C o l l i s i o n #2 3333377777bbb
C o l l i s i o n #3 444488888cc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Inthe“InputMEPcontents”,source#0stores3bytesfromcollision#0(labeledbycharacter
“1”), 3 bytes from collision #1 (labeled by character “2”), 5 bytes from collision #2 (labeled by
character “3”) and 4 bytes from collision #3 (labeled by character “4”).
Source #1 (corresponding to a different subdetector) stores 4 bytes from collision #0 (la-
beled by character “5”) followed by the data from the collisions #1 to #3. Source #2 stores 4
bytes also from collision #0 (labeled by character “9”) and likewise for the remaining collisions.
At this point, the transposition re-shuffles the data so that all the information from one col-
lision is placed together. Therefore, in the “Output MEP contents”, buffer for collision #0 con-
tains the previously mentioned 3 bytes from source #0 (labeled by character “1”), 4 bytes from
source #1 (labeled by character “5”) and 4 bytes from source #2 (labeled by character “9”).
Here is another example of the transposition:
−−−−−−−−−−Input MEP contents−−−−−−−−−−
Source #0 11111222333334444
Real-time applications on Intel Xeon/Phi 5
March 26, 2016
Source #1 5566667777788888
Source #2 99aaaaabbbcc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−Output MEP contents−−−−−−−−−
C o l l i s i o n #0 111115599
C o l l i s i o n #1 2226666aaaaa
C o l l i s i o n #2 3333377777bbb
C o l l i s i o n #3 444488888cc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
The lengths of MEP fragments (usually between 80-120 bytes per fragment) are repre-
sented as 16bit integers and they are stored in a separate array. The reason for this is the per-
formance improvement: more than one length value can be loaded into the cache line, so we
can read and process several lengths of fragments with one cache load.
ThebuffersforMEPfragmentsarestoredinanarrayofarrays. Thereisonearraymep_contents[i]
foreachsource#i. Acontinuousblockofmemoryisallocatedforeverysuchbuffermep_contents[i].
However, two consecutive buffers do not necessarily have to be in a continuous block of mem-
ory.
The output array is saved in one continuous block of memory. It stores the “re-shuffled”
copies of fragments, now grouped by collisions into collision blocks. Furthermore, the collision
blocks are concatenated according to the collision index. For instance, the first example above
would produce this output array:
111115599 2226666aaaaa 3333377777bbb 444488888cc
The spaces were added for clarity, in order to separate different collisions.
1.2 ALGORITHM
In order to copy the data for transposition (for each fragment of each source), two types of
array offsets (represented as 32bit integers) need to be computed:
• read_offsets[] is the array of offsets determining where to copy from. It is the number of
bytes from the beginning of mep_contents[i] where source i is the source corresponding
to the fragment.
• write_offsets[] is the array of offsets determining where to copy to. It is the number of
bytes from the beginning of the output array.
Offsetsarecomputedbyapplyingprefixsumtoappropriateelementsofthearrayoflengths.
The prefix sum is the following problem: given an array of numbers a[], produce an array s []
of the same size, where s[0] = 0 and s[i] = a[0] + a[1] + ... + a[i − 1] for i > 0. The prefix-sum
problem is the core part of event sorting.
Real-time applications on Intel Xeon/Phi 6
March 26, 2016
Since prefix sum for read_offsets [] within one source buffer is independent of other com-
putations in other source buffers, we may parallelize using #pragma omp parallel for.
Similarly,prefixsumfor write_offsets [] canbealsoparallelizedusing#pragma omp parallel for
(for details, see the function get_write_offsets_OMP_version() in prefix−sum.cpp).
After the read_offsets and write_offsets are computed, the content of each fragment can
be copied using the standard memcpy() function. For MEP fragments, this copy-task is inde-
pendent of one another, and hence, can be run in parallel. Namely, #pragma omp parallel for
has been used to parallelize the loop. This loop iterates over all MEP fragments and performs
the memcopies.
1.3 THE GOAL
The goal of the demos is to test the speed and the feasibility of the Xeon Phi for event- sorting.
Possible performance improvements are studied, namely various parallelization techniques.
Real-time applications on Intel Xeon/Phi 7
March 26, 2016
2 OFFLOAD-BANDWIDTH
This programmeasures the bandwidth between host and the deviceusing the #pragma offload
directive...
a) offloading only to the device:
$ make && . / offload −bandwidth . exe −i 20 −e 1500000000
icpc −l r t main . cpp −o offload −bandwidth . exe
Using MIC0 . . .
Transferred : 30 GB
Total time : 4.37726 secs
Bandwidth : 6.8536 GBps
b) offloading only to the device, and copying the result back:
$ make && . / offload −bandwidth . exe −i 20 −e 1500000000
icpc −l r t main . cpp −o offload −bandwidth . exe
Using MIC0 . . .
Transferred : 60 GB
Total time : 8.67822 secs
Bandwidth : 6.91386 GBps
This bandwidth corresponds to the speed of 50 Gbit/s PCIe interface between the host and
the device. Here, the host machine is lhcb−phi.cern.ch (see Appendix A). The speed remains
the same even when the offload-bandwidth is launched to all 4 Xeon Phi cards at the same time
(as 4 concurrent processes). This means there are four 50 Gbit/s PCIe interfaces and each of
them can be fully saturated during offloads.
For more details, consult the README at https://github.com/mathemage/xphi-lhcb/
tree/master/src/offload-bandwidth#parallel-run-on-all-available-mics
Real-time applications on Intel Xeon/Phi 8
March 26, 2016
3 PREFIX-OFFSET
This program implements and tests the speed of prefix sum calculation.
a) 1000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to 100:
Total time : 521.639 secs
Processed : 7.66814e+07 elements per second
b) 100000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to
65534:
Total elements : 6000000000
Total time : 77.8086 secs
Processed : 7.71123e+07 elements per second
This is the result from 1 KNC card with lhcb−phi.cern.ch as the host (see Appendix A).
For more details, see the README at https://github.com/mathemage/xphi-lhcb/tree/
master/src/prefix-offset#output
Real-time applications on Intel Xeon/Phi 9
March 26, 2016
4 EVENT-SORT
LHCb Online owns 4 Intel Xeon Phi ”KNC” cards. They are available on lhcb−phi.cern.ch ma-
chine (see Appendix A).
4.1 THE DISTRIBUTION OF ITERATION DURATIONS
The simulation is iterated many times to avoid statistical fluctuations. Number of iterations is
controlled via command-line argument −i.
a) The results for 200 iterations:
# . / event−sort . mic . exe −i 200
. . .
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 2e+09
Time for computing read_offsets : 0.553636 secs
Time for computing write_offse ts : 2.50423 secs
Time for copying : 17.4631 secs
Total time : 20.521 secs
Total size : 230.013 GB
Processed : 9.74612e+07 elements per second
Throughput : 11.2087 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
Timeforcomputingread_offsetsisthetotaltimespentcalculatingprefixsumsforread_offsets [] ,
timeforcomputingwrite_offsetsisthetotaltimespentcalculatingprefixsumsfor write_offsets []
and time for copying is the total time of performing memcpy() of MEP fragments.
b) The results and the histogram for 1000 iterations:
−−−−−−−−STATISTICS OF TIME INTERVALS ( in secs)−−−−−−−−−−−−
The i n i t i a l i t e r a t i o n : 0.43506
min : 0.10139
max : 0.10303
mean : 0.10216
. . .
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−STATISTICS OF THROUGHPUTS ( in GBps)−−−−−−−−−−−−−−−
min : 11.16119
max : 11.34263
mean : 11.25702
Real-time applications on Intel Xeon/Phi 10
March 26, 2016
. . .
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 1e+10
Time for computing read_offsets : 3.14013 secs
Time for computing write_offse ts : 12.2161 secs
Time for copying : 86.8014 secs
Total time : 102.158 secs
Total size : 1149.98 GB
Processed : 9.7888e+07 elements per second
Throughput : 11.2569 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
The histograms of the previous measurements:
Real-time applications on Intel Xeon/Phi 11
March 26, 2016
4.2 COMPARISON BETWEEN EVENT-SORT AND RAW MEMCPY
The program memcpy-bandwidth tests only the throughput of the memcpy() function on the
Intel Xeon Phi. It copies chunks (arrays) of data from one place to another (with OpenMP pa-
rallelization). This process is iterated (50 times in the case below) and the final throughput is
calculated.
The number of threads is varied using #pragma omp parallel for num_threads(). The corre-
sponding plot is in Figure 2.
Real-time applications on Intel Xeon/Phi 12
March 26, 2016
FIGURE 2: EVENT-SORT COMPARED TO RAW MEMCPY(), WITH VARIABLE NUMBER OF
THREADS
4.3 BLOCKSCHEMES FOR MEMCPY
The memory access patterns for event-sort can be optimized by splitting the workload into
blocks or blockschemes of fragments. The serial version of event-sort would process frag-
ments as shown in Figure 3. Each circle represents one MEP fragment, indexed by its source
and its event.
FIGURE 3: WITHOUT A BLOCKSCHEME
Real-time applications on Intel Xeon/Phi 13
March 26, 2016
Thepreviouslymentionedparallelizedevent-sortwouldassigneachcircletoasingleworker-
thread. Since the sizes of fragments are typically 80-120 B, the memcpy is ineffective because
the core caches are much larger and thus not fully used.
By assigning the whole block of workload to every worker-thread, we reduce cache thrash-
ing. There are 4 blocks of 2x2 size in the blockscheme of Figure 4, which would be processed
by 4 worker-threads in parallel.
FIGURE 4: 2X2 BLOCKS
Moreover, the spatial locality of data can also play important role: fragments in the rows of
the picture are stored in a continuous block of memory. Thus, the blocks load from and store
into only continuous parts of memory.
The algorithm is given the block dimensions (on the picture: 2 sources per each block, 2
events per each block). The blocks are then distributed among worker-threads (by OpenMP
parallel for loop). Within every block, each assigned worker performs a memcpy using pre-
viously computed read_offsets [] and write_offsets [] .
Inordertofindoutoptimalblockdimensions, aseriesofbenchmarktestshavebeencarried
out. The results are represented in the following heatmap:
Real-time applications on Intel Xeon/Phi 14
March 26, 2016
FIGURE 5: EVENT-SORT WITH VARIOUS PARAMETERS OF BLOCKSCHEME (KNC)
The event-sort with optimal block dimensions (according to the heatmap on the right side):
# . / upload−to−MIC . sh −i 100 −1 5 −2 28
. . .
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ S U M M A R Y _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Total elements : 1e+09
Time for computing read_offsets : 0.28435 secs
Time for computing write_offse ts : 1.13954 secs
Time for copying : 3.1574 secs
Total time : 4.58129 secs
Total size : 114.998 GB
Processed : 2.18279e+08 elements per second
Throughput : 25.1016 GBps
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Comparing the times, about 69 % of all the time is spent doing memcopies. The rest is the
computation of offsets. Moreover, the overall throughput has been improved by a factor of > 2
(compare with Section 4.1).
Real-time applications on Intel Xeon/Phi 15
March 26, 2016
4.4 ASLR ON KNC AND ITS EFFECT ON EVENT-SORT
Address Space Layout Randomization (ASLR) was suspected to cause great inconsistency in
resultsonKNLXeonPhi. ThiswaspointedoutbyWimHeirman. Thisisthee-mailconversation
with him:
Hi Karel,
I did some more runs, now with Linux address randomization turned on (my
machine had it disabled previously). I do see some large variations now. Do you
have address randomization turned on for your machine? (see output of "sysctl
kernel.randomize_va_space", 0 means disabled while 1 and 2 enable different
parts of it). Can you do a few more runs with a disabled setting? (See [1], I
think the setarch -R option should work even if you don't have root access).
Regards,
Wim
[1]http://stackoverflow.com/questions/11238457/disable-and-re-enable-address-
space-layout-randomization-only-for-myself
I have tried my application on KNCs with various settings of ASLR. There were 100 experi-
ments (runs), each performed only 1 iteration.
For kernel.randomize_va_space = 0:
mean = 20.0434 min = 19.6947 max = 20.4567 standard deviation = 0.1267
For kernel.randomize_va_space = 1:
mean = 20.3565 min = 19.5846 max = 21.1473 standard deviation = 0.3669
For kernel.randomize_va_space = 2:
mean = 20.305 min = 19.555 max = 21.1037 standard deviation = 0.3641
In conclusion, it seems ASLR does have some effect on variation.
4.5 FIXATION OF INPUT DATA
Rainer and I had a hypothesis that the throughput of event-sort may be highly dependent on
the input data size (if lengths fit cache lines). In order to test this idea, I have implemented an
option −−srand−seed. It sets a custom seed for srand() function, which is used for random-
izing the input data. Hence, by initializing to a (chosen) custom seed, the input will be always
same between different runs.
Real-time applications on Intel Xeon/Phi 16
March 26, 2016
For the range of seeds from 0 to 100, I have studied the variabilities (mean, standard devi-
ation, min, max) of resulting throughputs. The screenshot of results is to be found in Figure 6.
The mean, the (sample-based) standard deviation, the min and the mix are always taken from
10 runs. Each one initializes srand() to the same seed (the one corresponding to the seed in the
first column). Blue and red cells are the min and max respectively of values in the correspond-
ing column.
For comparison, here is an entirely serial version (i.e. copy_MEPs_serial_version()) with the
two chosen seeds:
• srand-seed == 83:
mean = 0.111149 standard deviation = 4.47532e-05 min = 0.111081 max = 0.111204
mean = 0.111167 standard deviation = 8.98804e-05 min = 0.111082 max = 0.111397
mean = 0.11108 standard deviation = 0.000120816 min = 0.110984 max = 0.111401
• srand-seed == 89:
mean = 0.111119 standard deviation = 5.10757e-05 min = 0.11104 max = 0.111186
mean = 0.111151 standard deviation = 5.33504e-05 min = 0.111079 max = 0.111227
mean = 0.111093 standard deviation = 0.000144087 min = 0.110992 max = 0.111487
There was no OpenMP for the copying part, but there are still two OpenMP parallel func-
tions for the computation part. That’s why it’s not absolutely 0.
The conclusion is: even though the deviation is negligible, it’s far from (almost) 0. This sug-
gests that the variation is caused by another cause or reason, possibly non-determinism of
thread scheduling.
Real-time applications on Intel Xeon/Phi 17
March 26, 2016
FIGURE 6: EVENT-SORT (IN GBYTES/S) ON KNC FOR VARIOUS ASLR AND VARIOUS FIXA-
TED INPUT DATA (DEPENDENT ON THE SEED)
Real-time applications on Intel Xeon/Phi 18
March 26, 2016
4.6 VARYING OF NUMBER OF COPY-THREADS
Another idea is to fixate the input data and vary the number of threads, which are performing
the copying part. This is done by the OpenMP here:
void copy_MEPs_block_scheme ( ) {
. . .
#pragma omp p a r a l l e l for num_threads ( nthreads )
. . .
}
Figure 7 shows the dependency of (sample-based) standard deviation on the number of
copying threads. The deviation is taken out of 10 experiments (runs). The tested numbers of
copy-threads are 1, 2, 4, 8, 16, 32 and 64.
Figure 8 shows the identical experiment for all numbers of copy-threads from 1 to 64.
From the latter figure, it seems there is no apparent dependency between number of copy-
threads and standard deviation of runs.
Real-time applications on Intel Xeon/Phi 19
March 26, 2016
FIGURE 7: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS
(1, 2, 4, 8, 16, 32, 64 THREADS)
Real-time applications on Intel Xeon/Phi 20
March 26, 2016
FIGURE 8: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS
(1, 2, 3, 4, · · · , 64 THREADS)
Real-time applications on Intel Xeon/Phi 21
March 26, 2016
5 SOME IDEAS FOR FUTURE WORK
• “Recompile”theevent-sortproject using ispccompiler: https://ispc.github.io/. This
compiler has promising auto-vectorization capabilities.
• Write unit tests for the project. For instance, using Google Test framework: https://
github.com/google/googletest
• Use CMake instead of hand-written Makefiles: https://cmake.org/
• Consider(try,testandbenchmark)usageofIntelTBBfortheprefix-sumfunctions: https:
//www.threadingbuildingblocks.org/
• Consider(try,testandbenchmark)usageofOpenCLfortheprefix-sumfunctions: https:
//www.khronos.org/opencl/
• Run high_performance_linpack_benchmark on Xeon Phi: https://lbdokuwiki.cern.
ch/doku.php?id=upgrade:high_performance_linpack_benchmark
• Participate in CERN Concurrency Forum: http://concurrency.web.cern.ch/
Real-time applications on Intel Xeon/Phi 22
March 26, 2016
6 CONCLUSION
The simulations of event sorting task show that KNC is capable of delivering the throughput
of about 25 GB/s. Our aim was to reach 12 GB/s, so as to saturate the 100 Gbit/s Ethernet
network, which is one of the candidate network for the LHCb upgrade.
This has been accomplished by splitting the workload into blocks of fragments and letting
thethreadsmemcopythewholeblocksoffragmentsratherthandoingitfragmentbyfragment.
Theexcessthroughputcanbeexploitedasadditionalcomputingpower! Forexample, some
portion of Xeon Phi cards (cores, number of threads) can be allocated for event-sorting (just
enough for 12.5 GB/s), whereas the remaining capacity may be used for other algorithms, so as
to start the reconstruction process already in this very early stage. Thus, the overall quality of
decisions whether to store or discard the events would improve.
Real-time applications on Intel Xeon/Phi 23
March 26, 2016
A INFRASTRUCTURE
LHCb Online group provides the server machine lhcb−phi.cern.ch. This host machine contains
32 Intel(R) Xeon(R) 2.00GHz processors:
[ kha@lhcb−phi kha ] $ le ss / proc / cpuinfo | t a i l −n 26
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : I n t e l (R) Xeon (R) CPU E5−2650 0 @ 2.00GHz
stepping : 7
microcode : 1808
cpu MHz : 1200.000
cache size : 20480 KB
physical id : 1
s i b l i n g s : 16
core id : 7
cpu cores : 8
apicid : 47
i n i t i a l apicid : 47
fpu : yes
fpu_exception : yes
cpuid l e v e l : 13
wp : yes
f l a g s : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 c l f l u s h dts acpi
mmx fxsr sse sse2 ss ht tm pbe s y s c a l l nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat
epb xsaveopt pln pts dts tpr_shadow vnmi f l e x p r i o r i t y ept vpid
bogomips : 4014.16
c l f l u s h size : 64
cache_alignment : 64
address sizes : 46 b i t s physical , 48 b i t s v i r t u a l
power management :
with the operating system:
[ kha@lhcb−phi kha ] $ uname −a
Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux
Ontopofthat, therearealso4IntelKNCXeonPhicards(socalled“thedevices”, hereMIC0,
MIC1, MIC2 and MIC3). They are connected via PCIe 50 Gbit/s lanes to the host and each of
them has 228 of processors:
[ xeonphi@lhcb−phi−mic0 ~]$ le ss / proc / cpuinfo | t a i l −n 26
processor : 31
vendor_id : GenuineIntel
cpu family : 11
model : 1
model name : 0b/01
stepping : 3
cpu MHz : 1100.000
cache size : 512 KB
physical id : 0
s i b l i n g s : 228
core id : 56
cpu cores : 57
apicid : 227
i n i t i a l apicid : 227
fpu : yes
fpu_exception : yes
cpuid l e v e l : 4
wp : yes
f l a g s : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht s y s c a l l nx lm nopl lahf_lm
bogomips : 2205.22
c l f l u s h size : 64
cache_alignment : 64
address sizes : 40 b i t s physical , 48 b i t s v i r t u a l
power management :
each with the operating system:
[ kha@lhcb−phi kha ] $ uname −a
Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux
Real-time applications on Intel Xeon/Phi 24
March 26, 2016
B COMPILERS
The source code is written in C++ and uses OpenMP for task-based parallelization. It requires
Intel compiler:
[ kha@lhcb−phi event−sort ] $ icpc −V
I n t e l (R) Csum I n t e l (R) 64 Compiler XE for applications running on I n t e l (R) 64 , Version 15.0.3.187 Build 20150407
Copyright (C) 1985−2015 I n t e l Corporation . A l l r i g h t s reserved .
or Intel’s version of gcc compiler for cross-compilation on Xeon Phi:
[ kha@lhcb−phi event−sort ] $ / usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++ −v
Using built−in specs .
COLLECT_GCC=/ usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++
COLLECT_LTO_WRAPPER=/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss−
linux / gcc /k1om−mpss−linux / 4 . 7 . 0 / lto−wrapper
Target : k1om−mpss−linux
Configured with : / sandbox / build /tmp/tmp/work/ x86_64−nativesdk−mpsssdk−linux / gcc−cross−canadian−k1om−
4.7.0+ mpss3.5.1 −1/gcc −4.7.0+mpss3 . 5 . 1 / configure −−build=x86_64−linux −−host=x86_64−mpsssdk−linux
−−target=k1om−mpss−linux −−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr
−−exec_prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr
−−bindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux
−−sbindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux
−−l i b e x e c d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss−linux
−−datadir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share
−−sysconfdir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / etc
−−sharedstatedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux /com
−−l o c a l s t a t e d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / var
−−l i b d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / l i b /k1om−mpss−linux
−−includedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include
−−oldincludedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include
−−i n f o d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share / info
−−mandir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share /man −−disable−silent−rules −−disable−
dependency−tracking −−with−l i b t o o l−sysroot =/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk−
linux −−with−gnu−ld −−enable−shared −−enable−languages=c , c++ −−enable−threads=posix −−disable−m u l t i l i b
−−enable−c99 −−enable−long−long −−enable−symvers=gnu −−enable−libstdcxx−pch −−program−prefix =k1om−
mpss−linux−−−enable−target−optspace −−enable−l t o −−enable−l i b s s p −−disable−bootstrap −−disable−libgomp
−−disable−libmudflap −−with−system−z l i b −−with−linker−hash−s tyle =gnu −−enable−cheaders= c_global −−with−
local−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr −−with−gxx−include−
dir =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr / include / c++ −−with−build−time−
tools =/sandbox / build /tmp/tmp/ sysroots / x86_64−linux / usr /k1om−mpss−linux / bin −−with−
sysroot =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux −−with−build−
sysroot =/sandbox / build /tmp/tmp/ sysroots / knightscorner −−disable−libunwind−exceptions −−disable−l i b s s p
−−disable−libgomp −−disable−libmudflap −−with−mpfr=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−
mpsssdk−linux −−with−mpc=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk−linux −−enable−nls
−−enable−_ _ c x a _ a t e x i t
Thread model : posix
gcc version 4.7.0 20110509 ( experimental ) (GCC)
C REPRODUCING THE EVENT-SORT RESULTS
C.1 SOURCE CODE AND SETUP
The source code is available on GitHub: https://github.com/mathemage/xphi-lhcb
$ g i t clone git@github . com : mathemage/ xphi−lhcb . g i t
Cloning into ’ xphi−lhcb ’ . . .
. . .
$ cd xphi−lhcb /
Then source the CERN setup script for Intel tools:
source / afs / cern . ch /sw/ IntelSoftware / linux / a l l−setup . sh
To enable OpenMP, find the libiomp5.so file:
$ find / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / −name libiomp5 . so
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / cce /10.1.008/ l i b / libiomp5 . so
Real-time applications on Intel Xeon/Phi 25
March 26, 2016
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / ia32 / libiomp5 . so
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / intel64 / libiomp5 . so
. . .
...and copy it into the xphi−lhcb/lib/ folder.
Note: the instructions below were done and are valid for the commit:
commit ae7bc6ff540fbbdc0c1b09382f5e821e0c40e6dc
Author : Karel Ha <mathemage@gmail . com>
Date : Thu Oct 8 13:17:58 2015 +0200
Change location of libiomp5 . so
(The output produced by later versions of the repository may differ.)
C.2 OFFLOAD-BANDWIDTH
Change to the directory xphi−lhcb/src/offload−bandwidth/ and launch the program once for
each MIC cards (i.e. 4 processes in our case):
[ kha@lhcb−phi offload−bandwidth ] $ . / run−on−a l l−MICs . sh
icpc −l r t main . cpp −o offload−bandwidth . exe
Launching offload−bandwith on MIC 0 . . .
Launching offload−bandwith on MIC 1 . . .
Launching offload−bandwith on MIC 2 . . .
Launching offload−bandwith on MIC 3 . . .
After a while, when all processes finish, you may check the output in the following way...
[ kha@lhcb−phi offload−bandwidth ] $ cat * . out
Using MIC0 . . .
Transferred : 90 GB
Total time : 13.1119 secs
Bandwidth : 6.864 GBps
Using MIC1 . . .
Transferred : 90 GB
Total time : 13.5207 secs
Bandwidth : 6.65647 GBps
Using MIC2 . . .
Transferred : 90 GB
Total time : 13.1548 secs
Bandwidth : 6.84162 GBps
Using MIC3 . . .
Transferred : 90 GB
Total time : 25.9486 secs
Bandwidth : 3.4684 Gbps
C.3 PREFIX-OFFSET
Change to the directory xphi−lhcb/src/prefix−offset/ and run the script:
[ kha@lhcb−phi prefix−offset ] $ . / upload−to−MIC . sh
icpc −l r t −I . . / . . / include −openmp −std=c++14 −mmic main . cpp . . / u t i l s . cpp . . / prefix−sum . cpp −o mic−prefix−offset . exe
mic−prefix−offset . exe 100% 64KB 64.4KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
Generated random lengths :
Too many numbers to display !
Offsets :
Too many numbers to display !
Total elements : 200000000
Total time : 2.57888 secs
Processed : 7.75531e+07 elements per second
Processed : 0 GBps
C.4 EVENT-SORT
Change to the directory xphi−lhcb/src/event−sort/ and run the script:
Real-time applications on Intel Xeon/Phi 26
March 26, 2016
[ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh
Using MIC0 . . .
icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp 
−o event−sort . mic . exe
icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location
event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00
benchmarks . sh 100% 898 0.9KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
−−−−−−−−STATISTICS OF TIME INTERVALS−−−−−−−−
The i n i t i a l i t e r a t i o n : 0.47684 secs
min : 0.15831 secs
max : 0.15947 secs
mean : 0.15889 secs
Histogram :
[0.15831 , 0.15860): 2 times
[0.15860 , 0.15889): 4 times
[0.15889 , 0.15918): 2 times
[0.15918 , 0.15947): 2 times
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 1e+08
Time for computing read_offsets : 0.159042 secs
Time for computing write_offsets : 0.288448 secs
Time for copying : 1.14138 secs
Total time : 1.58887 secs
Total size : 11.5004 GB
Processed : 6.29379e+07 elements per second
Throughput : 7.23812 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
This script cross-compiles the source code for the Intel Xeon Phi architecture and uploads
binaries and required libraries using scp. On the MIC, the binary is called with default settings
of parameters.
You can also run several benchmark tests with varying the number of sources and the MEP
factor and varying the number of iterations:
[ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh −b
Running benchmarks . sh
Using MIC0 . . .
icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp 
−o event−sort . mic . exe
icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location
event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00
benchmarks . sh 100% 898 0.9KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
Varying the number of sources and the MEP factor . . .
. / event−sort . mic . exe −s 1 −m 10000000
. . .
Varying the number of i t e r a t i o n s . . .
. . .
Real-time applications on Intel Xeon/Phi 27

More Related Content

What's hot

BMW INPA diagnostic interface FAQ
BMW INPA diagnostic interface FAQBMW INPA diagnostic interface FAQ
BMW INPA diagnostic interface FAQbuyobdii
 
Diversion First Overview
Diversion First OverviewDiversion First Overview
Diversion First OverviewFairfax County
 
GPT3 API vs. Reality
GPT3 API vs. RealityGPT3 API vs. Reality
GPT3 API vs. RealityTim Spalding
 
Modes of 80386
Modes of 80386Modes of 80386
Modes of 80386aviban
 
Instruction set of 8085 Microprocessor By Er. Swapnil Kaware
Instruction set of 8085 Microprocessor By Er. Swapnil KawareInstruction set of 8085 Microprocessor By Er. Swapnil Kaware
Instruction set of 8085 Microprocessor By Er. Swapnil KawareProf. Swapnil V. Kaware
 
Comparison of pentium processor with 80386 and 80486
Comparison of pentium processor with  80386 and 80486Comparison of pentium processor with  80386 and 80486
Comparison of pentium processor with 80386 and 80486Tech_MX
 
Evolution of Microprocessor
Evolution of MicroprocessorEvolution of Microprocessor
Evolution of MicroprocessorFarahNawar
 
The 8 Layers of the OSI.pdf
The 8 Layers of the OSI.pdfThe 8 Layers of the OSI.pdf
The 8 Layers of the OSI.pdfssuserd67eb9
 
8051,chapter1,architecture and peripherals
8051,chapter1,architecture and peripherals8051,chapter1,architecture and peripherals
8051,chapter1,architecture and peripheralsamrutachintawar239
 
Addressing mode of 8051
Addressing mode of 8051Addressing mode of 8051
Addressing mode of 8051Nitin Ahire
 
I o ports and timers of 8051
I o ports and timers of 8051I o ports and timers of 8051
I o ports and timers of 8051SARITHA REDDY
 
Page replacement algorithm
Page replacement algorithmPage replacement algorithm
Page replacement algorithmLavina Gehlot
 
8086 Microprocessor powerpoint
8086  Microprocessor  powerpoint8086  Microprocessor  powerpoint
8086 Microprocessor powerpointRandhir Kumar
 

What's hot (20)

BMW INPA diagnostic interface FAQ
BMW INPA diagnostic interface FAQBMW INPA diagnostic interface FAQ
BMW INPA diagnostic interface FAQ
 
ifrs 09 impairment, impairment, Investment impairment,
ifrs 09 impairment, impairment, Investment impairment, ifrs 09 impairment, impairment, Investment impairment,
ifrs 09 impairment, impairment, Investment impairment,
 
Appunti interrupt 8086
Appunti interrupt 8086Appunti interrupt 8086
Appunti interrupt 8086
 
Diversion First Overview
Diversion First OverviewDiversion First Overview
Diversion First Overview
 
GPT3 API vs. Reality
GPT3 API vs. RealityGPT3 API vs. Reality
GPT3 API vs. Reality
 
EE8551 MPMC
EE8551  MPMCEE8551  MPMC
EE8551 MPMC
 
Modes of 80386
Modes of 80386Modes of 80386
Modes of 80386
 
Instruction set of 8085 Microprocessor By Er. Swapnil Kaware
Instruction set of 8085 Microprocessor By Er. Swapnil KawareInstruction set of 8085 Microprocessor By Er. Swapnil Kaware
Instruction set of 8085 Microprocessor By Er. Swapnil Kaware
 
Comparison of pentium processor with 80386 and 80486
Comparison of pentium processor with  80386 and 80486Comparison of pentium processor with  80386 and 80486
Comparison of pentium processor with 80386 and 80486
 
IoT Use Cases
IoT Use CasesIoT Use Cases
IoT Use Cases
 
Evolution of Microprocessor
Evolution of MicroprocessorEvolution of Microprocessor
Evolution of Microprocessor
 
The 8 Layers of the OSI.pdf
The 8 Layers of the OSI.pdfThe 8 Layers of the OSI.pdf
The 8 Layers of the OSI.pdf
 
8051,chapter1,architecture and peripherals
8051,chapter1,architecture and peripherals8051,chapter1,architecture and peripherals
8051,chapter1,architecture and peripherals
 
Addressing mode of 8051
Addressing mode of 8051Addressing mode of 8051
Addressing mode of 8051
 
I o ports and timers of 8051
I o ports and timers of 8051I o ports and timers of 8051
I o ports and timers of 8051
 
Page replacement algorithm
Page replacement algorithmPage replacement algorithm
Page replacement algorithm
 
8086 Microprocessor powerpoint
8086  Microprocessor  powerpoint8086  Microprocessor  powerpoint
8086 Microprocessor powerpoint
 
Timers and pwm
Timers and pwmTimers and pwm
Timers and pwm
 
MICROCONTROLLER - INTEL 8051
MICROCONTROLLER - INTEL 8051MICROCONTROLLER - INTEL 8051
MICROCONTROLLER - INTEL 8051
 
The 80386 80486
The 80386 80486The 80386 80486
The 80386 80486
 

Similar to Real-time applications on IntelXeon/Phi

Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiDeep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiGaurav Raina
 
Deep Convolutional Network evaluation on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon PhiDeep Convolutional Network evaluation on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon PhiGaurav Raina
 
Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Gaurav Raina
 
Improved kernel based port-knocking in linux
Improved kernel based port-knocking in linuxImproved kernel based port-knocking in linux
Improved kernel based port-knocking in linuxdinomasch
 
Ali.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli Kamali
 
system on chip for telecommand system design
system on chip for telecommand system designsystem on chip for telecommand system design
system on chip for telecommand system designRaghavendra Badager
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management frameworkSaurabh Nambiar
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 
A Push-pull based Application Multicast Layer for P2P live video streaming.pdf
A Push-pull based Application Multicast Layer for P2P live video streaming.pdfA Push-pull based Application Multicast Layer for P2P live video streaming.pdf
A Push-pull based Application Multicast Layer for P2P live video streaming.pdfNuioKila
 
Large Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based designLarge Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based designDhiman Chowdhury
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of RobotiumSusan Tullis
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkDatabricks
 
Virtual Machines Security Internals: Detection and Exploitation
 Virtual Machines Security Internals: Detection and Exploitation Virtual Machines Security Internals: Detection and Exploitation
Virtual Machines Security Internals: Detection and ExploitationMattia Salvi
 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Mingliang Liu
 
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...NECST Lab @ Politecnico di Milano
 
Pregel In Graphs - Models and Instances
Pregel In Graphs - Models and InstancesPregel In Graphs - Models and Instances
Pregel In Graphs - Models and InstancesChase Zhang
 

Similar to Real-time applications on IntelXeon/Phi (20)

Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiDeep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
 
Deep Convolutional Network evaluation on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon PhiDeep Convolutional Network evaluation on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon Phi
 
Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2
 
Improved kernel based port-knocking in linux
Improved kernel based port-knocking in linuxImproved kernel based port-knocking in linux
Improved kernel based port-knocking in linux
 
Ali.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFUAli.Kamali-MSc.Thesis-SFU
Ali.Kamali-MSc.Thesis-SFU
 
system on chip for telecommand system design
system on chip for telecommand system designsystem on chip for telecommand system design
system on chip for telecommand system design
 
Tutorial for EDA Tools:
Tutorial for EDA Tools:Tutorial for EDA Tools:
Tutorial for EDA Tools:
 
Tutorial for EDA Tools
Tutorial for EDA ToolsTutorial for EDA Tools
Tutorial for EDA Tools
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
 
bakalarska_praca
bakalarska_pracabakalarska_praca
bakalarska_praca
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
A Push-pull based Application Multicast Layer for P2P live video streaming.pdf
A Push-pull based Application Multicast Layer for P2P live video streaming.pdfA Push-pull based Application Multicast Layer for P2P live video streaming.pdf
A Push-pull based Application Multicast Layer for P2P live video streaming.pdf
 
Large Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based designLarge Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based design
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of Robotium
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Virtual Machines Security Internals: Detection and Exploitation
 Virtual Machines Security Internals: Detection and Exploitation Virtual Machines Security Internals: Detection and Exploitation
Virtual Machines Security Internals: Detection and Exploitation
 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
 
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
 
Pregel In Graphs - Models and Instances
Pregel In Graphs - Models and InstancesPregel In Graphs - Models and Instances
Pregel In Graphs - Models and Instances
 

More from Karel Ha

transcript-master-studies-Karel-Ha
transcript-master-studies-Karel-Hatranscript-master-studies-Karel-Ha
transcript-master-studies-Karel-HaKarel Ha
 
Schrodinger poster 2020
Schrodinger poster 2020Schrodinger poster 2020
Schrodinger poster 2020Karel Ha
 
CapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkCapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkKarel Ha
 
Dynamic Routing Between Capsules
Dynamic Routing Between CapsulesDynamic Routing Between Capsules
Dynamic Routing Between CapsulesKarel Ha
 
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...Karel Ha
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerKarel Ha
 
transcript-bachelor-studies-Karel-Ha
transcript-bachelor-studies-Karel-Hatranscript-bachelor-studies-Karel-Ha
transcript-bachelor-studies-Karel-HaKarel Ha
 
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchAlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchKarel Ha
 
Mastering the game of Go with deep neural networks and tree search: Presentation
Mastering the game of Go with deep neural networks and tree search: PresentationMastering the game of Go with deep neural networks and tree search: Presentation
Mastering the game of Go with deep neural networks and tree search: PresentationKarel Ha
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015Karel Ha
 
Separation Axioms
Separation AxiomsSeparation Axioms
Separation AxiomsKarel Ha
 
Oddělovací axiomy v bezbodové topologii
Oddělovací axiomy v bezbodové topologiiOddělovací axiomy v bezbodové topologii
Oddělovací axiomy v bezbodové topologiiKarel Ha
 
Algorithmic Game Theory
Algorithmic Game TheoryAlgorithmic Game Theory
Algorithmic Game TheoryKarel Ha
 
Summer Student Programme
Summer Student ProgrammeSummer Student Programme
Summer Student ProgrammeKarel Ha
 
Summer @CERN
Summer @CERNSummer @CERN
Summer @CERNKarel Ha
 
Tape Storage and CRC Protection
Tape Storage and CRC ProtectionTape Storage and CRC Protection
Tape Storage and CRC ProtectionKarel Ha
 
Question Answering with Subgraph Embeddings
Question Answering with Subgraph EmbeddingsQuestion Answering with Subgraph Embeddings
Question Answering with Subgraph EmbeddingsKarel Ha
 

More from Karel Ha (18)

transcript-master-studies-Karel-Ha
transcript-master-studies-Karel-Hatranscript-master-studies-Karel-Ha
transcript-master-studies-Karel-Ha
 
Schrodinger poster 2020
Schrodinger poster 2020Schrodinger poster 2020
Schrodinger poster 2020
 
CapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkCapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule Network
 
Dynamic Routing Between Capsules
Dynamic Routing Between CapsulesDynamic Routing Between Capsules
Dynamic Routing Between Capsules
 
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
 
AlphaZero
AlphaZeroAlphaZero
AlphaZero
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
 
transcript-bachelor-studies-Karel-Ha
transcript-bachelor-studies-Karel-Hatranscript-bachelor-studies-Karel-Ha
transcript-bachelor-studies-Karel-Ha
 
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchAlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
 
Mastering the game of Go with deep neural networks and tree search: Presentation
Mastering the game of Go with deep neural networks and tree search: PresentationMastering the game of Go with deep neural networks and tree search: Presentation
Mastering the game of Go with deep neural networks and tree search: Presentation
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
Separation Axioms
Separation AxiomsSeparation Axioms
Separation Axioms
 
Oddělovací axiomy v bezbodové topologii
Oddělovací axiomy v bezbodové topologiiOddělovací axiomy v bezbodové topologii
Oddělovací axiomy v bezbodové topologii
 
Algorithmic Game Theory
Algorithmic Game TheoryAlgorithmic Game Theory
Algorithmic Game Theory
 
Summer Student Programme
Summer Student ProgrammeSummer Student Programme
Summer Student Programme
 
Summer @CERN
Summer @CERNSummer @CERN
Summer @CERN
 
Tape Storage and CRC Protection
Tape Storage and CRC ProtectionTape Storage and CRC Protection
Tape Storage and CRC Protection
 
Question Answering with Subgraph Embeddings
Question Answering with Subgraph EmbeddingsQuestion Answering with Subgraph Embeddings
Question Answering with Subgraph Embeddings
 

Recently uploaded

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 

Recently uploaded (20)

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 

Real-time applications on IntelXeon/Phi

  • 1. March 26, 2016 .. Real-time applications on Intel Xeon/Phi Karel Ha CERN High Throughput Computing collaboration Summary: The Intel Xeon/Phi platform is a powerful x86 multi-core engine with a very high-speed memory interface. In its next version it will be able to operate as a stand-alone system with a very high-speed interconnect. This makes it a very interesting candidate for (near) real-time applications such as event-building, event-sorting and event preparation for subsequent processing by high level trigger software algorithms. Real-time applications on Intel Xeon/Phi 1
  • 2. March 26, 2016 Abstract The following document is a report providing the first results on the performance of In- tel Xeon Phi computing accelerator in the context of LHCb Online Data Acquisition system (DAQ). Themainfocusisputintotheevent-sortingtask: whendataarrivefromdifferentsources corresponding to different parts of the LHCb detector, they are grouped by the source, from which they originate. In the next stage of DAQ, it is necessary to make a decision, whether to store the given collision event or not. For this purpose, it is more convenient to group the data by their memberships to collision events (i.e. all data from one collision need to be placed together), so that the DAQ system can decide based on the “whole picture” of one event. The Xeon Phi is an interesting candidate for event-sorting task. It offers a large number of cores and vast amount of memory. Furthermore, this task can also be very well paral- lelized, which can make it especially suitable for the many-core architecture of the Xeon Phi. Thus, this report may be used to study feasibility of the Intel Xeon Phi platform for the next upgrade of the LHCb detector in 2018-2019. Real-time applications on Intel Xeon/Phi 2
  • 3. March 26, 2016 Contents 1 Introduction 4 1.1 Description of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 The goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Offload-bandwidth 8 3 Prefix-offset 9 4 Event-sort 10 4.1 The distribution of iteration durations . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Comparison between event-sort and raw memcpy . . . . . . . . . . . . . . . . . . 12 4.3 Blockschemes for memcpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.4 ASLR on KNC and its effect on event-sort . . . . . . . . . . . . . . . . . . . . . . . 16 4.5 Fixation of input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6 Varying of number of copy-threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Some ideas for future work 22 6 Conclusion 23 Appendix A Infrastructure 24 Appendix B Compilers 25 Appendix C Reproducing the event-sort results 25 C.1 Source code and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.2 Offload-bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.3 Prefix-offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.4 Event-sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Real-time applications on Intel Xeon/Phi 3
  • 4. March 26, 2016 1 INTRODUCTION Intel Xeon Phi or Intel Many Integrated Core Architecture (MIC) is a promising x86 many- core computing accelerator. As such, it is suitable for highly parallelizable jobs such as event- sorting, a subtask of LHCb Data Acquisition System (DAQ). In this report, we present our mea- surementsofevent-sortingonIntelXeonPhicard, specifically“KnightsCorner”(KNC)version. There are 3 demo programs: • offload-bandwidth • prefix-offset • event-sort Thefirsttwo partsserveas preliminarytoolsfor baseline benchmarksandtesting theprop- erties of Xeon Phi, whereas the last one simulates the real conditions of event-sort in LHCb DAQ. For details on the used software and hardware, consult Appendix C. There are also the in- structions for reproducing the results. There is also a shared CERNBox folder htcc_shared, which contains all the logs that I regu- larly kept during my internship. For full details (source codes, bash and gnuplot scripts, figures, raw output files and results etc.), acquire an access to the shared folder and consult my logs. 1.1 DESCRIPTION OF THE PROBLEM The LHCb detector at CERN is a complex instrument consisting of many subdetectors. Hence, there are also many (approximately 1000) sources of input channels for the DAQ system. Each of the readout boards keeps the fragments of information (so called MEP fragments or also mep_contents in the source code) in its own buffer. The fragments come from different chan- nels and different collisions. The number of collisions is called MEP factor (by default 10000 fragments per source). For further processing, however, it is much more favorable to re-arrange (transpose) the fragments and group them together according to the collision they belong to: Real-time applications on Intel Xeon/Phi 4
  • 5. March 26, 2016 FIGURE 1: TRANSPOSE OF FRAGMENTS For better illustration, see the example below: −−−−−−−−−−Input MEP contents−−−−−−−−−− Source #0 111222333334444 Source #1 555566667777788888 Source #2 9999aaaaabbbcc −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−Output MEP contents−−−−−−−−− C o l l i s i o n #0 11155559999 C o l l i s i o n #1 2226666aaaaa C o l l i s i o n #2 3333377777bbb C o l l i s i o n #3 444488888cc −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Inthe“InputMEPcontents”,source#0stores3bytesfromcollision#0(labeledbycharacter “1”), 3 bytes from collision #1 (labeled by character “2”), 5 bytes from collision #2 (labeled by character “3”) and 4 bytes from collision #3 (labeled by character “4”). Source #1 (corresponding to a different subdetector) stores 4 bytes from collision #0 (la- beled by character “5”) followed by the data from the collisions #1 to #3. Source #2 stores 4 bytes also from collision #0 (labeled by character “9”) and likewise for the remaining collisions. At this point, the transposition re-shuffles the data so that all the information from one col- lision is placed together. Therefore, in the “Output MEP contents”, buffer for collision #0 con- tains the previously mentioned 3 bytes from source #0 (labeled by character “1”), 4 bytes from source #1 (labeled by character “5”) and 4 bytes from source #2 (labeled by character “9”). Here is another example of the transposition: −−−−−−−−−−Input MEP contents−−−−−−−−−− Source #0 11111222333334444 Real-time applications on Intel Xeon/Phi 5
  • 6. March 26, 2016 Source #1 5566667777788888 Source #2 99aaaaabbbcc −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−Output MEP contents−−−−−−−−− C o l l i s i o n #0 111115599 C o l l i s i o n #1 2226666aaaaa C o l l i s i o n #2 3333377777bbb C o l l i s i o n #3 444488888cc −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− The lengths of MEP fragments (usually between 80-120 bytes per fragment) are repre- sented as 16bit integers and they are stored in a separate array. The reason for this is the per- formance improvement: more than one length value can be loaded into the cache line, so we can read and process several lengths of fragments with one cache load. ThebuffersforMEPfragmentsarestoredinanarrayofarrays. Thereisonearraymep_contents[i] foreachsource#i. Acontinuousblockofmemoryisallocatedforeverysuchbuffermep_contents[i]. However, two consecutive buffers do not necessarily have to be in a continuous block of mem- ory. The output array is saved in one continuous block of memory. It stores the “re-shuffled” copies of fragments, now grouped by collisions into collision blocks. Furthermore, the collision blocks are concatenated according to the collision index. For instance, the first example above would produce this output array: 111115599 2226666aaaaa 3333377777bbb 444488888cc The spaces were added for clarity, in order to separate different collisions. 1.2 ALGORITHM In order to copy the data for transposition (for each fragment of each source), two types of array offsets (represented as 32bit integers) need to be computed: • read_offsets[] is the array of offsets determining where to copy from. It is the number of bytes from the beginning of mep_contents[i] where source i is the source corresponding to the fragment. • write_offsets[] is the array of offsets determining where to copy to. It is the number of bytes from the beginning of the output array. Offsetsarecomputedbyapplyingprefixsumtoappropriateelementsofthearrayoflengths. The prefix sum is the following problem: given an array of numbers a[], produce an array s [] of the same size, where s[0] = 0 and s[i] = a[0] + a[1] + ... + a[i − 1] for i > 0. The prefix-sum problem is the core part of event sorting. Real-time applications on Intel Xeon/Phi 6
  • 7. March 26, 2016 Since prefix sum for read_offsets [] within one source buffer is independent of other com- putations in other source buffers, we may parallelize using #pragma omp parallel for. Similarly,prefixsumfor write_offsets [] canbealsoparallelizedusing#pragma omp parallel for (for details, see the function get_write_offsets_OMP_version() in prefix−sum.cpp). After the read_offsets and write_offsets are computed, the content of each fragment can be copied using the standard memcpy() function. For MEP fragments, this copy-task is inde- pendent of one another, and hence, can be run in parallel. Namely, #pragma omp parallel for has been used to parallelize the loop. This loop iterates over all MEP fragments and performs the memcopies. 1.3 THE GOAL The goal of the demos is to test the speed and the feasibility of the Xeon Phi for event- sorting. Possible performance improvements are studied, namely various parallelization techniques. Real-time applications on Intel Xeon/Phi 7
  • 8. March 26, 2016 2 OFFLOAD-BANDWIDTH This programmeasures the bandwidth between host and the deviceusing the #pragma offload directive... a) offloading only to the device: $ make && . / offload −bandwidth . exe −i 20 −e 1500000000 icpc −l r t main . cpp −o offload −bandwidth . exe Using MIC0 . . . Transferred : 30 GB Total time : 4.37726 secs Bandwidth : 6.8536 GBps b) offloading only to the device, and copying the result back: $ make && . / offload −bandwidth . exe −i 20 −e 1500000000 icpc −l r t main . cpp −o offload −bandwidth . exe Using MIC0 . . . Transferred : 60 GB Total time : 8.67822 secs Bandwidth : 6.91386 GBps This bandwidth corresponds to the speed of 50 Gbit/s PCIe interface between the host and the device. Here, the host machine is lhcb−phi.cern.ch (see Appendix A). The speed remains the same even when the offload-bandwidth is launched to all 4 Xeon Phi cards at the same time (as 4 concurrent processes). This means there are four 50 Gbit/s PCIe interfaces and each of them can be fully saturated during offloads. For more details, consult the README at https://github.com/mathemage/xphi-lhcb/ tree/master/src/offload-bandwidth#parallel-run-on-all-available-mics Real-time applications on Intel Xeon/Phi 8
  • 9. March 26, 2016 3 PREFIX-OFFSET This program implements and tests the speed of prefix sum calculation. a) 1000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to 100: Total time : 521.639 secs Processed : 7.66814e+07 elements per second b) 100000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to 65534: Total elements : 6000000000 Total time : 77.8086 secs Processed : 7.71123e+07 elements per second This is the result from 1 KNC card with lhcb−phi.cern.ch as the host (see Appendix A). For more details, see the README at https://github.com/mathemage/xphi-lhcb/tree/ master/src/prefix-offset#output Real-time applications on Intel Xeon/Phi 9
  • 10. March 26, 2016 4 EVENT-SORT LHCb Online owns 4 Intel Xeon Phi ”KNC” cards. They are available on lhcb−phi.cern.ch ma- chine (see Appendix A). 4.1 THE DISTRIBUTION OF ITERATION DURATIONS The simulation is iterated many times to avoid statistical fluctuations. Number of iterations is controlled via command-line argument −i. a) The results for 200 iterations: # . / event−sort . mic . exe −i 200 . . . −−−−−−−−−−SUMMARY−−−−−−−−−− Total elements : 2e+09 Time for computing read_offsets : 0.553636 secs Time for computing write_offse ts : 2.50423 secs Time for copying : 17.4631 secs Total time : 20.521 secs Total size : 230.013 GB Processed : 9.74612e+07 elements per second Throughput : 11.2087 GBps −−−−−−−−−−−−−−−−−−−−−−−−−−− Timeforcomputingread_offsetsisthetotaltimespentcalculatingprefixsumsforread_offsets [] , timeforcomputingwrite_offsetsisthetotaltimespentcalculatingprefixsumsfor write_offsets [] and time for copying is the total time of performing memcpy() of MEP fragments. b) The results and the histogram for 1000 iterations: −−−−−−−−STATISTICS OF TIME INTERVALS ( in secs)−−−−−−−−−−−− The i n i t i a l i t e r a t i o n : 0.43506 min : 0.10139 max : 0.10303 mean : 0.10216 . . . −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−STATISTICS OF THROUGHPUTS ( in GBps)−−−−−−−−−−−−−−− min : 11.16119 max : 11.34263 mean : 11.25702 Real-time applications on Intel Xeon/Phi 10
  • 11. March 26, 2016 . . . −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−SUMMARY−−−−−−−−−− Total elements : 1e+10 Time for computing read_offsets : 3.14013 secs Time for computing write_offse ts : 12.2161 secs Time for copying : 86.8014 secs Total time : 102.158 secs Total size : 1149.98 GB Processed : 9.7888e+07 elements per second Throughput : 11.2569 GBps −−−−−−−−−−−−−−−−−−−−−−−−−−− The histograms of the previous measurements: Real-time applications on Intel Xeon/Phi 11
  • 12. March 26, 2016 4.2 COMPARISON BETWEEN EVENT-SORT AND RAW MEMCPY The program memcpy-bandwidth tests only the throughput of the memcpy() function on the Intel Xeon Phi. It copies chunks (arrays) of data from one place to another (with OpenMP pa- rallelization). This process is iterated (50 times in the case below) and the final throughput is calculated. The number of threads is varied using #pragma omp parallel for num_threads(). The corre- sponding plot is in Figure 2. Real-time applications on Intel Xeon/Phi 12
  • 13. March 26, 2016 FIGURE 2: EVENT-SORT COMPARED TO RAW MEMCPY(), WITH VARIABLE NUMBER OF THREADS 4.3 BLOCKSCHEMES FOR MEMCPY The memory access patterns for event-sort can be optimized by splitting the workload into blocks or blockschemes of fragments. The serial version of event-sort would process frag- ments as shown in Figure 3. Each circle represents one MEP fragment, indexed by its source and its event. FIGURE 3: WITHOUT A BLOCKSCHEME Real-time applications on Intel Xeon/Phi 13
  • 14. March 26, 2016 Thepreviouslymentionedparallelizedevent-sortwouldassigneachcircletoasingleworker- thread. Since the sizes of fragments are typically 80-120 B, the memcpy is ineffective because the core caches are much larger and thus not fully used. By assigning the whole block of workload to every worker-thread, we reduce cache thrash- ing. There are 4 blocks of 2x2 size in the blockscheme of Figure 4, which would be processed by 4 worker-threads in parallel. FIGURE 4: 2X2 BLOCKS Moreover, the spatial locality of data can also play important role: fragments in the rows of the picture are stored in a continuous block of memory. Thus, the blocks load from and store into only continuous parts of memory. The algorithm is given the block dimensions (on the picture: 2 sources per each block, 2 events per each block). The blocks are then distributed among worker-threads (by OpenMP parallel for loop). Within every block, each assigned worker performs a memcpy using pre- viously computed read_offsets [] and write_offsets [] . Inordertofindoutoptimalblockdimensions, aseriesofbenchmarktestshavebeencarried out. The results are represented in the following heatmap: Real-time applications on Intel Xeon/Phi 14
  • 15. March 26, 2016 FIGURE 5: EVENT-SORT WITH VARIOUS PARAMETERS OF BLOCKSCHEME (KNC) The event-sort with optimal block dimensions (according to the heatmap on the right side): # . / upload−to−MIC . sh −i 100 −1 5 −2 28 . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ S U M M A R Y _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Total elements : 1e+09 Time for computing read_offsets : 0.28435 secs Time for computing write_offse ts : 1.13954 secs Time for copying : 3.1574 secs Total time : 4.58129 secs Total size : 114.998 GB Processed : 2.18279e+08 elements per second Throughput : 25.1016 GBps _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Comparing the times, about 69 % of all the time is spent doing memcopies. The rest is the computation of offsets. Moreover, the overall throughput has been improved by a factor of > 2 (compare with Section 4.1). Real-time applications on Intel Xeon/Phi 15
  • 16. March 26, 2016 4.4 ASLR ON KNC AND ITS EFFECT ON EVENT-SORT Address Space Layout Randomization (ASLR) was suspected to cause great inconsistency in resultsonKNLXeonPhi. ThiswaspointedoutbyWimHeirman. Thisisthee-mailconversation with him: Hi Karel, I did some more runs, now with Linux address randomization turned on (my machine had it disabled previously). I do see some large variations now. Do you have address randomization turned on for your machine? (see output of "sysctl kernel.randomize_va_space", 0 means disabled while 1 and 2 enable different parts of it). Can you do a few more runs with a disabled setting? (See [1], I think the setarch -R option should work even if you don't have root access). Regards, Wim [1]http://stackoverflow.com/questions/11238457/disable-and-re-enable-address- space-layout-randomization-only-for-myself I have tried my application on KNCs with various settings of ASLR. There were 100 experi- ments (runs), each performed only 1 iteration. For kernel.randomize_va_space = 0: mean = 20.0434 min = 19.6947 max = 20.4567 standard deviation = 0.1267 For kernel.randomize_va_space = 1: mean = 20.3565 min = 19.5846 max = 21.1473 standard deviation = 0.3669 For kernel.randomize_va_space = 2: mean = 20.305 min = 19.555 max = 21.1037 standard deviation = 0.3641 In conclusion, it seems ASLR does have some effect on variation. 4.5 FIXATION OF INPUT DATA Rainer and I had a hypothesis that the throughput of event-sort may be highly dependent on the input data size (if lengths fit cache lines). In order to test this idea, I have implemented an option −−srand−seed. It sets a custom seed for srand() function, which is used for random- izing the input data. Hence, by initializing to a (chosen) custom seed, the input will be always same between different runs. Real-time applications on Intel Xeon/Phi 16
  • 17. March 26, 2016 For the range of seeds from 0 to 100, I have studied the variabilities (mean, standard devi- ation, min, max) of resulting throughputs. The screenshot of results is to be found in Figure 6. The mean, the (sample-based) standard deviation, the min and the mix are always taken from 10 runs. Each one initializes srand() to the same seed (the one corresponding to the seed in the first column). Blue and red cells are the min and max respectively of values in the correspond- ing column. For comparison, here is an entirely serial version (i.e. copy_MEPs_serial_version()) with the two chosen seeds: • srand-seed == 83: mean = 0.111149 standard deviation = 4.47532e-05 min = 0.111081 max = 0.111204 mean = 0.111167 standard deviation = 8.98804e-05 min = 0.111082 max = 0.111397 mean = 0.11108 standard deviation = 0.000120816 min = 0.110984 max = 0.111401 • srand-seed == 89: mean = 0.111119 standard deviation = 5.10757e-05 min = 0.11104 max = 0.111186 mean = 0.111151 standard deviation = 5.33504e-05 min = 0.111079 max = 0.111227 mean = 0.111093 standard deviation = 0.000144087 min = 0.110992 max = 0.111487 There was no OpenMP for the copying part, but there are still two OpenMP parallel func- tions for the computation part. That’s why it’s not absolutely 0. The conclusion is: even though the deviation is negligible, it’s far from (almost) 0. This sug- gests that the variation is caused by another cause or reason, possibly non-determinism of thread scheduling. Real-time applications on Intel Xeon/Phi 17
  • 18. March 26, 2016 FIGURE 6: EVENT-SORT (IN GBYTES/S) ON KNC FOR VARIOUS ASLR AND VARIOUS FIXA- TED INPUT DATA (DEPENDENT ON THE SEED) Real-time applications on Intel Xeon/Phi 18
  • 19. March 26, 2016 4.6 VARYING OF NUMBER OF COPY-THREADS Another idea is to fixate the input data and vary the number of threads, which are performing the copying part. This is done by the OpenMP here: void copy_MEPs_block_scheme ( ) { . . . #pragma omp p a r a l l e l for num_threads ( nthreads ) . . . } Figure 7 shows the dependency of (sample-based) standard deviation on the number of copying threads. The deviation is taken out of 10 experiments (runs). The tested numbers of copy-threads are 1, 2, 4, 8, 16, 32 and 64. Figure 8 shows the identical experiment for all numbers of copy-threads from 1 to 64. From the latter figure, it seems there is no apparent dependency between number of copy- threads and standard deviation of runs. Real-time applications on Intel Xeon/Phi 19
  • 20. March 26, 2016 FIGURE 7: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS (1, 2, 4, 8, 16, 32, 64 THREADS) Real-time applications on Intel Xeon/Phi 20
  • 21. March 26, 2016 FIGURE 8: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS (1, 2, 3, 4, · · · , 64 THREADS) Real-time applications on Intel Xeon/Phi 21
  • 22. March 26, 2016 5 SOME IDEAS FOR FUTURE WORK • “Recompile”theevent-sortproject using ispccompiler: https://ispc.github.io/. This compiler has promising auto-vectorization capabilities. • Write unit tests for the project. For instance, using Google Test framework: https:// github.com/google/googletest • Use CMake instead of hand-written Makefiles: https://cmake.org/ • Consider(try,testandbenchmark)usageofIntelTBBfortheprefix-sumfunctions: https: //www.threadingbuildingblocks.org/ • Consider(try,testandbenchmark)usageofOpenCLfortheprefix-sumfunctions: https: //www.khronos.org/opencl/ • Run high_performance_linpack_benchmark on Xeon Phi: https://lbdokuwiki.cern. ch/doku.php?id=upgrade:high_performance_linpack_benchmark • Participate in CERN Concurrency Forum: http://concurrency.web.cern.ch/ Real-time applications on Intel Xeon/Phi 22
  • 23. March 26, 2016 6 CONCLUSION The simulations of event sorting task show that KNC is capable of delivering the throughput of about 25 GB/s. Our aim was to reach 12 GB/s, so as to saturate the 100 Gbit/s Ethernet network, which is one of the candidate network for the LHCb upgrade. This has been accomplished by splitting the workload into blocks of fragments and letting thethreadsmemcopythewholeblocksoffragmentsratherthandoingitfragmentbyfragment. Theexcessthroughputcanbeexploitedasadditionalcomputingpower! Forexample, some portion of Xeon Phi cards (cores, number of threads) can be allocated for event-sorting (just enough for 12.5 GB/s), whereas the remaining capacity may be used for other algorithms, so as to start the reconstruction process already in this very early stage. Thus, the overall quality of decisions whether to store or discard the events would improve. Real-time applications on Intel Xeon/Phi 23
  • 24. March 26, 2016 A INFRASTRUCTURE LHCb Online group provides the server machine lhcb−phi.cern.ch. This host machine contains 32 Intel(R) Xeon(R) 2.00GHz processors: [ kha@lhcb−phi kha ] $ le ss / proc / cpuinfo | t a i l −n 26 processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : I n t e l (R) Xeon (R) CPU E5−2650 0 @ 2.00GHz stepping : 7 microcode : 1808 cpu MHz : 1200.000 cache size : 20480 KB physical id : 1 s i b l i n g s : 16 core id : 7 cpu cores : 8 apicid : 47 i n i t i a l apicid : 47 fpu : yes fpu_exception : yes cpuid l e v e l : 13 wp : yes f l a g s : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 c l f l u s h dts acpi mmx fxsr sse sse2 ss ht tm pbe s y s c a l l nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi f l e x p r i o r i t y ept vpid bogomips : 4014.16 c l f l u s h size : 64 cache_alignment : 64 address sizes : 46 b i t s physical , 48 b i t s v i r t u a l power management : with the operating system: [ kha@lhcb−phi kha ] $ uname −a Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux Ontopofthat, therearealso4IntelKNCXeonPhicards(socalled“thedevices”, hereMIC0, MIC1, MIC2 and MIC3). They are connected via PCIe 50 Gbit/s lanes to the host and each of them has 228 of processors: [ xeonphi@lhcb−phi−mic0 ~]$ le ss / proc / cpuinfo | t a i l −n 26 processor : 31 vendor_id : GenuineIntel cpu family : 11 model : 1 model name : 0b/01 stepping : 3 cpu MHz : 1100.000 cache size : 512 KB physical id : 0 s i b l i n g s : 228 core id : 56 cpu cores : 57 apicid : 227 i n i t i a l apicid : 227 fpu : yes fpu_exception : yes cpuid l e v e l : 4 wp : yes f l a g s : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht s y s c a l l nx lm nopl lahf_lm bogomips : 2205.22 c l f l u s h size : 64 cache_alignment : 64 address sizes : 40 b i t s physical , 48 b i t s v i r t u a l power management : each with the operating system: [ kha@lhcb−phi kha ] $ uname −a Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux Real-time applications on Intel Xeon/Phi 24
  • 25. March 26, 2016 B COMPILERS The source code is written in C++ and uses OpenMP for task-based parallelization. It requires Intel compiler: [ kha@lhcb−phi event−sort ] $ icpc −V I n t e l (R) Csum I n t e l (R) 64 Compiler XE for applications running on I n t e l (R) 64 , Version 15.0.3.187 Build 20150407 Copyright (C) 1985−2015 I n t e l Corporation . A l l r i g h t s reserved . or Intel’s version of gcc compiler for cross-compilation on Xeon Phi: [ kha@lhcb−phi event−sort ] $ / usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++ −v Using built−in specs . COLLECT_GCC=/ usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++ COLLECT_LTO_WRAPPER=/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss− linux / gcc /k1om−mpss−linux / 4 . 7 . 0 / lto−wrapper Target : k1om−mpss−linux Configured with : / sandbox / build /tmp/tmp/work/ x86_64−nativesdk−mpsssdk−linux / gcc−cross−canadian−k1om− 4.7.0+ mpss3.5.1 −1/gcc −4.7.0+mpss3 . 5 . 1 / configure −−build=x86_64−linux −−host=x86_64−mpsssdk−linux −−target=k1om−mpss−linux −−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr −−exec_prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr −−bindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux −−sbindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux −−l i b e x e c d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss−linux −−datadir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share −−sysconfdir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / etc −−sharedstatedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux /com −−l o c a l s t a t e d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / var −−l i b d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / l i b /k1om−mpss−linux −−includedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include −−oldincludedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include −−i n f o d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share / info −−mandir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share /man −−disable−silent−rules −−disable− dependency−tracking −−with−l i b t o o l−sysroot =/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk− linux −−with−gnu−ld −−enable−shared −−enable−languages=c , c++ −−enable−threads=posix −−disable−m u l t i l i b −−enable−c99 −−enable−long−long −−enable−symvers=gnu −−enable−libstdcxx−pch −−program−prefix =k1om− mpss−linux−−−enable−target−optspace −−enable−l t o −−enable−l i b s s p −−disable−bootstrap −−disable−libgomp −−disable−libmudflap −−with−system−z l i b −−with−linker−hash−s tyle =gnu −−enable−cheaders= c_global −−with− local−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr −−with−gxx−include− dir =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr / include / c++ −−with−build−time− tools =/sandbox / build /tmp/tmp/ sysroots / x86_64−linux / usr /k1om−mpss−linux / bin −−with− sysroot =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux −−with−build− sysroot =/sandbox / build /tmp/tmp/ sysroots / knightscorner −−disable−libunwind−exceptions −−disable−l i b s s p −−disable−libgomp −−disable−libmudflap −−with−mpfr=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk− mpsssdk−linux −−with−mpc=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk−linux −−enable−nls −−enable−_ _ c x a _ a t e x i t Thread model : posix gcc version 4.7.0 20110509 ( experimental ) (GCC) C REPRODUCING THE EVENT-SORT RESULTS C.1 SOURCE CODE AND SETUP The source code is available on GitHub: https://github.com/mathemage/xphi-lhcb $ g i t clone git@github . com : mathemage/ xphi−lhcb . g i t Cloning into ’ xphi−lhcb ’ . . . . . . $ cd xphi−lhcb / Then source the CERN setup script for Intel tools: source / afs / cern . ch /sw/ IntelSoftware / linux / a l l−setup . sh To enable OpenMP, find the libiomp5.so file: $ find / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / −name libiomp5 . so / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / cce /10.1.008/ l i b / libiomp5 . so Real-time applications on Intel Xeon/Phi 25
  • 26. March 26, 2016 / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / ia32 / libiomp5 . so / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / intel64 / libiomp5 . so . . . ...and copy it into the xphi−lhcb/lib/ folder. Note: the instructions below were done and are valid for the commit: commit ae7bc6ff540fbbdc0c1b09382f5e821e0c40e6dc Author : Karel Ha <mathemage@gmail . com> Date : Thu Oct 8 13:17:58 2015 +0200 Change location of libiomp5 . so (The output produced by later versions of the repository may differ.) C.2 OFFLOAD-BANDWIDTH Change to the directory xphi−lhcb/src/offload−bandwidth/ and launch the program once for each MIC cards (i.e. 4 processes in our case): [ kha@lhcb−phi offload−bandwidth ] $ . / run−on−a l l−MICs . sh icpc −l r t main . cpp −o offload−bandwidth . exe Launching offload−bandwith on MIC 0 . . . Launching offload−bandwith on MIC 1 . . . Launching offload−bandwith on MIC 2 . . . Launching offload−bandwith on MIC 3 . . . After a while, when all processes finish, you may check the output in the following way... [ kha@lhcb−phi offload−bandwidth ] $ cat * . out Using MIC0 . . . Transferred : 90 GB Total time : 13.1119 secs Bandwidth : 6.864 GBps Using MIC1 . . . Transferred : 90 GB Total time : 13.5207 secs Bandwidth : 6.65647 GBps Using MIC2 . . . Transferred : 90 GB Total time : 13.1548 secs Bandwidth : 6.84162 GBps Using MIC3 . . . Transferred : 90 GB Total time : 25.9486 secs Bandwidth : 3.4684 Gbps C.3 PREFIX-OFFSET Change to the directory xphi−lhcb/src/prefix−offset/ and run the script: [ kha@lhcb−phi prefix−offset ] $ . / upload−to−MIC . sh icpc −l r t −I . . / . . / include −openmp −std=c++14 −mmic main . cpp . . / u t i l s . cpp . . / prefix−sum . cpp −o mic−prefix−offset . exe mic−prefix−offset . exe 100% 64KB 64.4KB/ s 00:00 libiomp5 . so 100% 1268KB 1.2MB/ s 00:00 Generated random lengths : Too many numbers to display ! Offsets : Too many numbers to display ! Total elements : 200000000 Total time : 2.57888 secs Processed : 7.75531e+07 elements per second Processed : 0 GBps C.4 EVENT-SORT Change to the directory xphi−lhcb/src/event−sort/ and run the script: Real-time applications on Intel Xeon/Phi 26
  • 27. March 26, 2016 [ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh Using MIC0 . . . icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp −o event−sort . mic . exe icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00 benchmarks . sh 100% 898 0.9KB/ s 00:00 libiomp5 . so 100% 1268KB 1.2MB/ s 00:00 −−−−−−−−STATISTICS OF TIME INTERVALS−−−−−−−− The i n i t i a l i t e r a t i o n : 0.47684 secs min : 0.15831 secs max : 0.15947 secs mean : 0.15889 secs Histogram : [0.15831 , 0.15860): 2 times [0.15860 , 0.15889): 4 times [0.15889 , 0.15918): 2 times [0.15918 , 0.15947): 2 times −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−SUMMARY−−−−−−−−−− Total elements : 1e+08 Time for computing read_offsets : 0.159042 secs Time for computing write_offsets : 0.288448 secs Time for copying : 1.14138 secs Total time : 1.58887 secs Total size : 11.5004 GB Processed : 6.29379e+07 elements per second Throughput : 7.23812 GBps −−−−−−−−−−−−−−−−−−−−−−−−−−− This script cross-compiles the source code for the Intel Xeon Phi architecture and uploads binaries and required libraries using scp. On the MIC, the binary is called with default settings of parameters. You can also run several benchmark tests with varying the number of sources and the MEP factor and varying the number of iterations: [ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh −b Running benchmarks . sh Using MIC0 . . . icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp −o event−sort . mic . exe icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00 benchmarks . sh 100% 898 0.9KB/ s 00:00 libiomp5 . so 100% 1268KB 1.2MB/ s 00:00 Varying the number of sources and the MEP factor . . . . / event−sort . mic . exe −s 1 −m 10000000 . . . Varying the number of i t e r a t i o n s . . . . . . Real-time applications on Intel Xeon/Phi 27