This is the final report for my project as a Technical Student at CERN.
The Intel Xeon/Phi platform is a powerful x86 multi-core engine with a very high-speed memory interface. In its next version it will be able to operate as a stand-alone system with a very high-speed interconnect. This makes it a very interesting candidate for (near) real-time applications such as event-building, event-sorting and event preparation for subsequent processing by high level trigger software algorithms.
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Real-time applications on IntelXeon/Phi
1. March 26, 2016
..
Real-time applications on Intel Xeon/Phi
Karel Ha
CERN High Throughput Computing collaboration
Summary:
The Intel Xeon/Phi platform is a powerful x86 multi-core engine with a very high-speed
memory interface. In its next version it will be able to operate as a stand-alone system with a
very high-speed interconnect. This makes it a very interesting candidate for (near) real-time
applications such as event-building, event-sorting and event preparation for subsequent
processing by high level trigger software algorithms.
Real-time applications on Intel Xeon/Phi 1
2. March 26, 2016
Abstract
The following document is a report providing the first results on the performance of In-
tel Xeon Phi computing accelerator in the context of LHCb Online Data Acquisition system
(DAQ).
Themainfocusisputintotheevent-sortingtask: whendataarrivefromdifferentsources
corresponding to different parts of the LHCb detector, they are grouped by the source,
from which they originate. In the next stage of DAQ, it is necessary to make a decision,
whether to store the given collision event or not. For this purpose, it is more convenient to
group the data by their memberships to collision events (i.e. all data from one collision need
to be placed together), so that the DAQ system can decide based on the “whole picture” of
one event.
The Xeon Phi is an interesting candidate for event-sorting task. It offers a large number
of cores and vast amount of memory. Furthermore, this task can also be very well paral-
lelized, which can make it especially suitable for the many-core architecture of the Xeon
Phi. Thus, this report may be used to study feasibility of the Intel Xeon Phi platform for the
next upgrade of the LHCb detector in 2018-2019.
Real-time applications on Intel Xeon/Phi 2
4. March 26, 2016
1 INTRODUCTION
Intel Xeon Phi or Intel Many Integrated Core Architecture (MIC) is a promising x86 many-
core computing accelerator. As such, it is suitable for highly parallelizable jobs such as event-
sorting, a subtask of LHCb Data Acquisition System (DAQ). In this report, we present our mea-
surementsofevent-sortingonIntelXeonPhicard, specifically“KnightsCorner”(KNC)version.
There are 3 demo programs:
• offload-bandwidth
• prefix-offset
• event-sort
Thefirsttwo partsserveas preliminarytoolsfor baseline benchmarksandtesting theprop-
erties of Xeon Phi, whereas the last one simulates the real conditions of event-sort in LHCb
DAQ.
For details on the used software and hardware, consult Appendix C. There are also the in-
structions for reproducing the results.
There is also a shared CERNBox folder htcc_shared, which contains all the logs that I regu-
larly kept during my internship. For full details (source codes, bash and gnuplot scripts, figures,
raw output files and results etc.), acquire an access to the shared folder and consult my logs.
1.1 DESCRIPTION OF THE PROBLEM
The LHCb detector at CERN is a complex instrument consisting of many subdetectors. Hence,
there are also many (approximately 1000) sources of input channels for the DAQ system. Each
of the readout boards keeps the fragments of information (so called MEP fragments or also
mep_contents in the source code) in its own buffer. The fragments come from different chan-
nels and different collisions. The number of collisions is called MEP factor (by default 10000
fragments per source).
For further processing, however, it is much more favorable to re-arrange (transpose) the
fragments and group them together according to the collision they belong to:
Real-time applications on Intel Xeon/Phi 4
5. March 26, 2016
FIGURE 1: TRANSPOSE OF FRAGMENTS
For better illustration, see the example below:
−−−−−−−−−−Input MEP contents−−−−−−−−−−
Source #0 111222333334444
Source #1 555566667777788888
Source #2 9999aaaaabbbcc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−Output MEP contents−−−−−−−−−
C o l l i s i o n #0 11155559999
C o l l i s i o n #1 2226666aaaaa
C o l l i s i o n #2 3333377777bbb
C o l l i s i o n #3 444488888cc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Inthe“InputMEPcontents”,source#0stores3bytesfromcollision#0(labeledbycharacter
“1”), 3 bytes from collision #1 (labeled by character “2”), 5 bytes from collision #2 (labeled by
character “3”) and 4 bytes from collision #3 (labeled by character “4”).
Source #1 (corresponding to a different subdetector) stores 4 bytes from collision #0 (la-
beled by character “5”) followed by the data from the collisions #1 to #3. Source #2 stores 4
bytes also from collision #0 (labeled by character “9”) and likewise for the remaining collisions.
At this point, the transposition re-shuffles the data so that all the information from one col-
lision is placed together. Therefore, in the “Output MEP contents”, buffer for collision #0 con-
tains the previously mentioned 3 bytes from source #0 (labeled by character “1”), 4 bytes from
source #1 (labeled by character “5”) and 4 bytes from source #2 (labeled by character “9”).
Here is another example of the transposition:
−−−−−−−−−−Input MEP contents−−−−−−−−−−
Source #0 11111222333334444
Real-time applications on Intel Xeon/Phi 5
6. March 26, 2016
Source #1 5566667777788888
Source #2 99aaaaabbbcc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−Output MEP contents−−−−−−−−−
C o l l i s i o n #0 111115599
C o l l i s i o n #1 2226666aaaaa
C o l l i s i o n #2 3333377777bbb
C o l l i s i o n #3 444488888cc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
The lengths of MEP fragments (usually between 80-120 bytes per fragment) are repre-
sented as 16bit integers and they are stored in a separate array. The reason for this is the per-
formance improvement: more than one length value can be loaded into the cache line, so we
can read and process several lengths of fragments with one cache load.
ThebuffersforMEPfragmentsarestoredinanarrayofarrays. Thereisonearraymep_contents[i]
foreachsource#i. Acontinuousblockofmemoryisallocatedforeverysuchbuffermep_contents[i].
However, two consecutive buffers do not necessarily have to be in a continuous block of mem-
ory.
The output array is saved in one continuous block of memory. It stores the “re-shuffled”
copies of fragments, now grouped by collisions into collision blocks. Furthermore, the collision
blocks are concatenated according to the collision index. For instance, the first example above
would produce this output array:
111115599 2226666aaaaa 3333377777bbb 444488888cc
The spaces were added for clarity, in order to separate different collisions.
1.2 ALGORITHM
In order to copy the data for transposition (for each fragment of each source), two types of
array offsets (represented as 32bit integers) need to be computed:
• read_offsets[] is the array of offsets determining where to copy from. It is the number of
bytes from the beginning of mep_contents[i] where source i is the source corresponding
to the fragment.
• write_offsets[] is the array of offsets determining where to copy to. It is the number of
bytes from the beginning of the output array.
Offsetsarecomputedbyapplyingprefixsumtoappropriateelementsofthearrayoflengths.
The prefix sum is the following problem: given an array of numbers a[], produce an array s []
of the same size, where s[0] = 0 and s[i] = a[0] + a[1] + ... + a[i − 1] for i > 0. The prefix-sum
problem is the core part of event sorting.
Real-time applications on Intel Xeon/Phi 6
7. March 26, 2016
Since prefix sum for read_offsets [] within one source buffer is independent of other com-
putations in other source buffers, we may parallelize using #pragma omp parallel for.
Similarly,prefixsumfor write_offsets [] canbealsoparallelizedusing#pragma omp parallel for
(for details, see the function get_write_offsets_OMP_version() in prefix−sum.cpp).
After the read_offsets and write_offsets are computed, the content of each fragment can
be copied using the standard memcpy() function. For MEP fragments, this copy-task is inde-
pendent of one another, and hence, can be run in parallel. Namely, #pragma omp parallel for
has been used to parallelize the loop. This loop iterates over all MEP fragments and performs
the memcopies.
1.3 THE GOAL
The goal of the demos is to test the speed and the feasibility of the Xeon Phi for event- sorting.
Possible performance improvements are studied, namely various parallelization techniques.
Real-time applications on Intel Xeon/Phi 7
8. March 26, 2016
2 OFFLOAD-BANDWIDTH
This programmeasures the bandwidth between host and the deviceusing the #pragma offload
directive...
a) offloading only to the device:
$ make && . / offload −bandwidth . exe −i 20 −e 1500000000
icpc −l r t main . cpp −o offload −bandwidth . exe
Using MIC0 . . .
Transferred : 30 GB
Total time : 4.37726 secs
Bandwidth : 6.8536 GBps
b) offloading only to the device, and copying the result back:
$ make && . / offload −bandwidth . exe −i 20 −e 1500000000
icpc −l r t main . cpp −o offload −bandwidth . exe
Using MIC0 . . .
Transferred : 60 GB
Total time : 8.67822 secs
Bandwidth : 6.91386 GBps
This bandwidth corresponds to the speed of 50 Gbit/s PCIe interface between the host and
the device. Here, the host machine is lhcb−phi.cern.ch (see Appendix A). The speed remains
the same even when the offload-bandwidth is launched to all 4 Xeon Phi cards at the same time
(as 4 concurrent processes). This means there are four 50 Gbit/s PCIe interfaces and each of
them can be fully saturated during offloads.
For more details, consult the README at https://github.com/mathemage/xphi-lhcb/
tree/master/src/offload-bandwidth#parallel-run-on-all-available-mics
Real-time applications on Intel Xeon/Phi 8
9. March 26, 2016
3 PREFIX-OFFSET
This program implements and tests the speed of prefix sum calculation.
a) 1000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to 100:
Total time : 521.639 secs
Processed : 7.66814e+07 elements per second
b) 100000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to
65534:
Total elements : 6000000000
Total time : 77.8086 secs
Processed : 7.71123e+07 elements per second
This is the result from 1 KNC card with lhcb−phi.cern.ch as the host (see Appendix A).
For more details, see the README at https://github.com/mathemage/xphi-lhcb/tree/
master/src/prefix-offset#output
Real-time applications on Intel Xeon/Phi 9
10. March 26, 2016
4 EVENT-SORT
LHCb Online owns 4 Intel Xeon Phi ”KNC” cards. They are available on lhcb−phi.cern.ch ma-
chine (see Appendix A).
4.1 THE DISTRIBUTION OF ITERATION DURATIONS
The simulation is iterated many times to avoid statistical fluctuations. Number of iterations is
controlled via command-line argument −i.
a) The results for 200 iterations:
# . / event−sort . mic . exe −i 200
. . .
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 2e+09
Time for computing read_offsets : 0.553636 secs
Time for computing write_offse ts : 2.50423 secs
Time for copying : 17.4631 secs
Total time : 20.521 secs
Total size : 230.013 GB
Processed : 9.74612e+07 elements per second
Throughput : 11.2087 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
Timeforcomputingread_offsetsisthetotaltimespentcalculatingprefixsumsforread_offsets [] ,
timeforcomputingwrite_offsetsisthetotaltimespentcalculatingprefixsumsfor write_offsets []
and time for copying is the total time of performing memcpy() of MEP fragments.
b) The results and the histogram for 1000 iterations:
−−−−−−−−STATISTICS OF TIME INTERVALS ( in secs)−−−−−−−−−−−−
The i n i t i a l i t e r a t i o n : 0.43506
min : 0.10139
max : 0.10303
mean : 0.10216
. . .
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−STATISTICS OF THROUGHPUTS ( in GBps)−−−−−−−−−−−−−−−
min : 11.16119
max : 11.34263
mean : 11.25702
Real-time applications on Intel Xeon/Phi 10
11. March 26, 2016
. . .
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 1e+10
Time for computing read_offsets : 3.14013 secs
Time for computing write_offse ts : 12.2161 secs
Time for copying : 86.8014 secs
Total time : 102.158 secs
Total size : 1149.98 GB
Processed : 9.7888e+07 elements per second
Throughput : 11.2569 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
The histograms of the previous measurements:
Real-time applications on Intel Xeon/Phi 11
12. March 26, 2016
4.2 COMPARISON BETWEEN EVENT-SORT AND RAW MEMCPY
The program memcpy-bandwidth tests only the throughput of the memcpy() function on the
Intel Xeon Phi. It copies chunks (arrays) of data from one place to another (with OpenMP pa-
rallelization). This process is iterated (50 times in the case below) and the final throughput is
calculated.
The number of threads is varied using #pragma omp parallel for num_threads(). The corre-
sponding plot is in Figure 2.
Real-time applications on Intel Xeon/Phi 12
13. March 26, 2016
FIGURE 2: EVENT-SORT COMPARED TO RAW MEMCPY(), WITH VARIABLE NUMBER OF
THREADS
4.3 BLOCKSCHEMES FOR MEMCPY
The memory access patterns for event-sort can be optimized by splitting the workload into
blocks or blockschemes of fragments. The serial version of event-sort would process frag-
ments as shown in Figure 3. Each circle represents one MEP fragment, indexed by its source
and its event.
FIGURE 3: WITHOUT A BLOCKSCHEME
Real-time applications on Intel Xeon/Phi 13
14. March 26, 2016
Thepreviouslymentionedparallelizedevent-sortwouldassigneachcircletoasingleworker-
thread. Since the sizes of fragments are typically 80-120 B, the memcpy is ineffective because
the core caches are much larger and thus not fully used.
By assigning the whole block of workload to every worker-thread, we reduce cache thrash-
ing. There are 4 blocks of 2x2 size in the blockscheme of Figure 4, which would be processed
by 4 worker-threads in parallel.
FIGURE 4: 2X2 BLOCKS
Moreover, the spatial locality of data can also play important role: fragments in the rows of
the picture are stored in a continuous block of memory. Thus, the blocks load from and store
into only continuous parts of memory.
The algorithm is given the block dimensions (on the picture: 2 sources per each block, 2
events per each block). The blocks are then distributed among worker-threads (by OpenMP
parallel for loop). Within every block, each assigned worker performs a memcpy using pre-
viously computed read_offsets [] and write_offsets [] .
Inordertofindoutoptimalblockdimensions, aseriesofbenchmarktestshavebeencarried
out. The results are represented in the following heatmap:
Real-time applications on Intel Xeon/Phi 14
15. March 26, 2016
FIGURE 5: EVENT-SORT WITH VARIOUS PARAMETERS OF BLOCKSCHEME (KNC)
The event-sort with optimal block dimensions (according to the heatmap on the right side):
# . / upload−to−MIC . sh −i 100 −1 5 −2 28
. . .
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ S U M M A R Y _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Total elements : 1e+09
Time for computing read_offsets : 0.28435 secs
Time for computing write_offse ts : 1.13954 secs
Time for copying : 3.1574 secs
Total time : 4.58129 secs
Total size : 114.998 GB
Processed : 2.18279e+08 elements per second
Throughput : 25.1016 GBps
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Comparing the times, about 69 % of all the time is spent doing memcopies. The rest is the
computation of offsets. Moreover, the overall throughput has been improved by a factor of > 2
(compare with Section 4.1).
Real-time applications on Intel Xeon/Phi 15
16. March 26, 2016
4.4 ASLR ON KNC AND ITS EFFECT ON EVENT-SORT
Address Space Layout Randomization (ASLR) was suspected to cause great inconsistency in
resultsonKNLXeonPhi. ThiswaspointedoutbyWimHeirman. Thisisthee-mailconversation
with him:
Hi Karel,
I did some more runs, now with Linux address randomization turned on (my
machine had it disabled previously). I do see some large variations now. Do you
have address randomization turned on for your machine? (see output of "sysctl
kernel.randomize_va_space", 0 means disabled while 1 and 2 enable different
parts of it). Can you do a few more runs with a disabled setting? (See [1], I
think the setarch -R option should work even if you don't have root access).
Regards,
Wim
[1]http://stackoverflow.com/questions/11238457/disable-and-re-enable-address-
space-layout-randomization-only-for-myself
I have tried my application on KNCs with various settings of ASLR. There were 100 experi-
ments (runs), each performed only 1 iteration.
For kernel.randomize_va_space = 0:
mean = 20.0434 min = 19.6947 max = 20.4567 standard deviation = 0.1267
For kernel.randomize_va_space = 1:
mean = 20.3565 min = 19.5846 max = 21.1473 standard deviation = 0.3669
For kernel.randomize_va_space = 2:
mean = 20.305 min = 19.555 max = 21.1037 standard deviation = 0.3641
In conclusion, it seems ASLR does have some effect on variation.
4.5 FIXATION OF INPUT DATA
Rainer and I had a hypothesis that the throughput of event-sort may be highly dependent on
the input data size (if lengths fit cache lines). In order to test this idea, I have implemented an
option −−srand−seed. It sets a custom seed for srand() function, which is used for random-
izing the input data. Hence, by initializing to a (chosen) custom seed, the input will be always
same between different runs.
Real-time applications on Intel Xeon/Phi 16
17. March 26, 2016
For the range of seeds from 0 to 100, I have studied the variabilities (mean, standard devi-
ation, min, max) of resulting throughputs. The screenshot of results is to be found in Figure 6.
The mean, the (sample-based) standard deviation, the min and the mix are always taken from
10 runs. Each one initializes srand() to the same seed (the one corresponding to the seed in the
first column). Blue and red cells are the min and max respectively of values in the correspond-
ing column.
For comparison, here is an entirely serial version (i.e. copy_MEPs_serial_version()) with the
two chosen seeds:
• srand-seed == 83:
mean = 0.111149 standard deviation = 4.47532e-05 min = 0.111081 max = 0.111204
mean = 0.111167 standard deviation = 8.98804e-05 min = 0.111082 max = 0.111397
mean = 0.11108 standard deviation = 0.000120816 min = 0.110984 max = 0.111401
• srand-seed == 89:
mean = 0.111119 standard deviation = 5.10757e-05 min = 0.11104 max = 0.111186
mean = 0.111151 standard deviation = 5.33504e-05 min = 0.111079 max = 0.111227
mean = 0.111093 standard deviation = 0.000144087 min = 0.110992 max = 0.111487
There was no OpenMP for the copying part, but there are still two OpenMP parallel func-
tions for the computation part. That’s why it’s not absolutely 0.
The conclusion is: even though the deviation is negligible, it’s far from (almost) 0. This sug-
gests that the variation is caused by another cause or reason, possibly non-determinism of
thread scheduling.
Real-time applications on Intel Xeon/Phi 17
18. March 26, 2016
FIGURE 6: EVENT-SORT (IN GBYTES/S) ON KNC FOR VARIOUS ASLR AND VARIOUS FIXA-
TED INPUT DATA (DEPENDENT ON THE SEED)
Real-time applications on Intel Xeon/Phi 18
19. March 26, 2016
4.6 VARYING OF NUMBER OF COPY-THREADS
Another idea is to fixate the input data and vary the number of threads, which are performing
the copying part. This is done by the OpenMP here:
void copy_MEPs_block_scheme ( ) {
. . .
#pragma omp p a r a l l e l for num_threads ( nthreads )
. . .
}
Figure 7 shows the dependency of (sample-based) standard deviation on the number of
copying threads. The deviation is taken out of 10 experiments (runs). The tested numbers of
copy-threads are 1, 2, 4, 8, 16, 32 and 64.
Figure 8 shows the identical experiment for all numbers of copy-threads from 1 to 64.
From the latter figure, it seems there is no apparent dependency between number of copy-
threads and standard deviation of runs.
Real-time applications on Intel Xeon/Phi 19
20. March 26, 2016
FIGURE 7: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS
(1, 2, 4, 8, 16, 32, 64 THREADS)
Real-time applications on Intel Xeon/Phi 20
21. March 26, 2016
FIGURE 8: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS
(1, 2, 3, 4, · · · , 64 THREADS)
Real-time applications on Intel Xeon/Phi 21
22. March 26, 2016
5 SOME IDEAS FOR FUTURE WORK
• “Recompile”theevent-sortproject using ispccompiler: https://ispc.github.io/. This
compiler has promising auto-vectorization capabilities.
• Write unit tests for the project. For instance, using Google Test framework: https://
github.com/google/googletest
• Use CMake instead of hand-written Makefiles: https://cmake.org/
• Consider(try,testandbenchmark)usageofIntelTBBfortheprefix-sumfunctions: https:
//www.threadingbuildingblocks.org/
• Consider(try,testandbenchmark)usageofOpenCLfortheprefix-sumfunctions: https:
//www.khronos.org/opencl/
• Run high_performance_linpack_benchmark on Xeon Phi: https://lbdokuwiki.cern.
ch/doku.php?id=upgrade:high_performance_linpack_benchmark
• Participate in CERN Concurrency Forum: http://concurrency.web.cern.ch/
Real-time applications on Intel Xeon/Phi 22
23. March 26, 2016
6 CONCLUSION
The simulations of event sorting task show that KNC is capable of delivering the throughput
of about 25 GB/s. Our aim was to reach 12 GB/s, so as to saturate the 100 Gbit/s Ethernet
network, which is one of the candidate network for the LHCb upgrade.
This has been accomplished by splitting the workload into blocks of fragments and letting
thethreadsmemcopythewholeblocksoffragmentsratherthandoingitfragmentbyfragment.
Theexcessthroughputcanbeexploitedasadditionalcomputingpower! Forexample, some
portion of Xeon Phi cards (cores, number of threads) can be allocated for event-sorting (just
enough for 12.5 GB/s), whereas the remaining capacity may be used for other algorithms, so as
to start the reconstruction process already in this very early stage. Thus, the overall quality of
decisions whether to store or discard the events would improve.
Real-time applications on Intel Xeon/Phi 23
24. March 26, 2016
A INFRASTRUCTURE
LHCb Online group provides the server machine lhcb−phi.cern.ch. This host machine contains
32 Intel(R) Xeon(R) 2.00GHz processors:
[ kha@lhcb−phi kha ] $ le ss / proc / cpuinfo | t a i l −n 26
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : I n t e l (R) Xeon (R) CPU E5−2650 0 @ 2.00GHz
stepping : 7
microcode : 1808
cpu MHz : 1200.000
cache size : 20480 KB
physical id : 1
s i b l i n g s : 16
core id : 7
cpu cores : 8
apicid : 47
i n i t i a l apicid : 47
fpu : yes
fpu_exception : yes
cpuid l e v e l : 13
wp : yes
f l a g s : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 c l f l u s h dts acpi
mmx fxsr sse sse2 ss ht tm pbe s y s c a l l nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat
epb xsaveopt pln pts dts tpr_shadow vnmi f l e x p r i o r i t y ept vpid
bogomips : 4014.16
c l f l u s h size : 64
cache_alignment : 64
address sizes : 46 b i t s physical , 48 b i t s v i r t u a l
power management :
with the operating system:
[ kha@lhcb−phi kha ] $ uname −a
Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux
Ontopofthat, therearealso4IntelKNCXeonPhicards(socalled“thedevices”, hereMIC0,
MIC1, MIC2 and MIC3). They are connected via PCIe 50 Gbit/s lanes to the host and each of
them has 228 of processors:
[ xeonphi@lhcb−phi−mic0 ~]$ le ss / proc / cpuinfo | t a i l −n 26
processor : 31
vendor_id : GenuineIntel
cpu family : 11
model : 1
model name : 0b/01
stepping : 3
cpu MHz : 1100.000
cache size : 512 KB
physical id : 0
s i b l i n g s : 228
core id : 56
cpu cores : 57
apicid : 227
i n i t i a l apicid : 227
fpu : yes
fpu_exception : yes
cpuid l e v e l : 4
wp : yes
f l a g s : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht s y s c a l l nx lm nopl lahf_lm
bogomips : 2205.22
c l f l u s h size : 64
cache_alignment : 64
address sizes : 40 b i t s physical , 48 b i t s v i r t u a l
power management :
each with the operating system:
[ kha@lhcb−phi kha ] $ uname −a
Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux
Real-time applications on Intel Xeon/Phi 24
25. March 26, 2016
B COMPILERS
The source code is written in C++ and uses OpenMP for task-based parallelization. It requires
Intel compiler:
[ kha@lhcb−phi event−sort ] $ icpc −V
I n t e l (R) Csum I n t e l (R) 64 Compiler XE for applications running on I n t e l (R) 64 , Version 15.0.3.187 Build 20150407
Copyright (C) 1985−2015 I n t e l Corporation . A l l r i g h t s reserved .
or Intel’s version of gcc compiler for cross-compilation on Xeon Phi:
[ kha@lhcb−phi event−sort ] $ / usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++ −v
Using built−in specs .
COLLECT_GCC=/ usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++
COLLECT_LTO_WRAPPER=/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss−
linux / gcc /k1om−mpss−linux / 4 . 7 . 0 / lto−wrapper
Target : k1om−mpss−linux
Configured with : / sandbox / build /tmp/tmp/work/ x86_64−nativesdk−mpsssdk−linux / gcc−cross−canadian−k1om−
4.7.0+ mpss3.5.1 −1/gcc −4.7.0+mpss3 . 5 . 1 / configure −−build=x86_64−linux −−host=x86_64−mpsssdk−linux
−−target=k1om−mpss−linux −−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr
−−exec_prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr
−−bindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux
−−sbindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux
−−l i b e x e c d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss−linux
−−datadir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share
−−sysconfdir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / etc
−−sharedstatedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux /com
−−l o c a l s t a t e d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / var
−−l i b d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / l i b /k1om−mpss−linux
−−includedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include
−−oldincludedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include
−−i n f o d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share / info
−−mandir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share /man −−disable−silent−rules −−disable−
dependency−tracking −−with−l i b t o o l−sysroot =/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk−
linux −−with−gnu−ld −−enable−shared −−enable−languages=c , c++ −−enable−threads=posix −−disable−m u l t i l i b
−−enable−c99 −−enable−long−long −−enable−symvers=gnu −−enable−libstdcxx−pch −−program−prefix =k1om−
mpss−linux−−−enable−target−optspace −−enable−l t o −−enable−l i b s s p −−disable−bootstrap −−disable−libgomp
−−disable−libmudflap −−with−system−z l i b −−with−linker−hash−s tyle =gnu −−enable−cheaders= c_global −−with−
local−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr −−with−gxx−include−
dir =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr / include / c++ −−with−build−time−
tools =/sandbox / build /tmp/tmp/ sysroots / x86_64−linux / usr /k1om−mpss−linux / bin −−with−
sysroot =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux −−with−build−
sysroot =/sandbox / build /tmp/tmp/ sysroots / knightscorner −−disable−libunwind−exceptions −−disable−l i b s s p
−−disable−libgomp −−disable−libmudflap −−with−mpfr=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−
mpsssdk−linux −−with−mpc=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk−linux −−enable−nls
−−enable−_ _ c x a _ a t e x i t
Thread model : posix
gcc version 4.7.0 20110509 ( experimental ) (GCC)
C REPRODUCING THE EVENT-SORT RESULTS
C.1 SOURCE CODE AND SETUP
The source code is available on GitHub: https://github.com/mathemage/xphi-lhcb
$ g i t clone git@github . com : mathemage/ xphi−lhcb . g i t
Cloning into ’ xphi−lhcb ’ . . .
. . .
$ cd xphi−lhcb /
Then source the CERN setup script for Intel tools:
source / afs / cern . ch /sw/ IntelSoftware / linux / a l l−setup . sh
To enable OpenMP, find the libiomp5.so file:
$ find / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / −name libiomp5 . so
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / cce /10.1.008/ l i b / libiomp5 . so
Real-time applications on Intel Xeon/Phi 25
26. March 26, 2016
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / ia32 / libiomp5 . so
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / intel64 / libiomp5 . so
. . .
...and copy it into the xphi−lhcb/lib/ folder.
Note: the instructions below were done and are valid for the commit:
commit ae7bc6ff540fbbdc0c1b09382f5e821e0c40e6dc
Author : Karel Ha <mathemage@gmail . com>
Date : Thu Oct 8 13:17:58 2015 +0200
Change location of libiomp5 . so
(The output produced by later versions of the repository may differ.)
C.2 OFFLOAD-BANDWIDTH
Change to the directory xphi−lhcb/src/offload−bandwidth/ and launch the program once for
each MIC cards (i.e. 4 processes in our case):
[ kha@lhcb−phi offload−bandwidth ] $ . / run−on−a l l−MICs . sh
icpc −l r t main . cpp −o offload−bandwidth . exe
Launching offload−bandwith on MIC 0 . . .
Launching offload−bandwith on MIC 1 . . .
Launching offload−bandwith on MIC 2 . . .
Launching offload−bandwith on MIC 3 . . .
After a while, when all processes finish, you may check the output in the following way...
[ kha@lhcb−phi offload−bandwidth ] $ cat * . out
Using MIC0 . . .
Transferred : 90 GB
Total time : 13.1119 secs
Bandwidth : 6.864 GBps
Using MIC1 . . .
Transferred : 90 GB
Total time : 13.5207 secs
Bandwidth : 6.65647 GBps
Using MIC2 . . .
Transferred : 90 GB
Total time : 13.1548 secs
Bandwidth : 6.84162 GBps
Using MIC3 . . .
Transferred : 90 GB
Total time : 25.9486 secs
Bandwidth : 3.4684 Gbps
C.3 PREFIX-OFFSET
Change to the directory xphi−lhcb/src/prefix−offset/ and run the script:
[ kha@lhcb−phi prefix−offset ] $ . / upload−to−MIC . sh
icpc −l r t −I . . / . . / include −openmp −std=c++14 −mmic main . cpp . . / u t i l s . cpp . . / prefix−sum . cpp −o mic−prefix−offset . exe
mic−prefix−offset . exe 100% 64KB 64.4KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
Generated random lengths :
Too many numbers to display !
Offsets :
Too many numbers to display !
Total elements : 200000000
Total time : 2.57888 secs
Processed : 7.75531e+07 elements per second
Processed : 0 GBps
C.4 EVENT-SORT
Change to the directory xphi−lhcb/src/event−sort/ and run the script:
Real-time applications on Intel Xeon/Phi 26
27. March 26, 2016
[ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh
Using MIC0 . . .
icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp
−o event−sort . mic . exe
icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location
event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00
benchmarks . sh 100% 898 0.9KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
−−−−−−−−STATISTICS OF TIME INTERVALS−−−−−−−−
The i n i t i a l i t e r a t i o n : 0.47684 secs
min : 0.15831 secs
max : 0.15947 secs
mean : 0.15889 secs
Histogram :
[0.15831 , 0.15860): 2 times
[0.15860 , 0.15889): 4 times
[0.15889 , 0.15918): 2 times
[0.15918 , 0.15947): 2 times
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 1e+08
Time for computing read_offsets : 0.159042 secs
Time for computing write_offsets : 0.288448 secs
Time for copying : 1.14138 secs
Total time : 1.58887 secs
Total size : 11.5004 GB
Processed : 6.29379e+07 elements per second
Throughput : 7.23812 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
This script cross-compiles the source code for the Intel Xeon Phi architecture and uploads
binaries and required libraries using scp. On the MIC, the binary is called with default settings
of parameters.
You can also run several benchmark tests with varying the number of sources and the MEP
factor and varying the number of iterations:
[ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh −b
Running benchmarks . sh
Using MIC0 . . .
icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp
−o event−sort . mic . exe
icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location
event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00
benchmarks . sh 100% 898 0.9KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
Varying the number of sources and the MEP factor . . .
. / event−sort . mic . exe −s 1 −m 10000000
. . .
Varying the number of i t e r a t i o n s . . .
. . .
Real-time applications on Intel Xeon/Phi 27