It seemed that long forum debates about methods of measuring algorithm's running time, functions to use and precision that should be expected were over. Unfortunately, we have to return to this question once again. Today we will discuss the question how we should measure speed of a parallel algorithm.
Presentation on how to chat with PDF using ChatGPT code interpreter
Â
An eternal question of timing
1. An eternal question of timing
Author: Andrey Karpov
Date: 30.03.2011
It seemed that long forum debates about methods of measuring algorithm's running time, functions to
use and precision that should be expected were over. Unfortunately, we have to return to this question
once again. Today we will discuss the question how we should measure speed of a parallel algorithm.
I want to say right away that I will not give you a concrete recipe. I myself have faced the issue of
measuring parallel algorithms' speed only recently, so I am not an expert in this question. So, this post is
rather a research article. I will appreciate if you share your opinions and recommendations with me. I
think we will manage the problem together and make out an optimal solution.
The task is to measure the running time of a fragment of user code. I would use the following class to
solve this task earlier:
class Timing {
public:
void StartTiming();
void StopTiming();
double GetUserSeconds() const {
return double(m_userTime) / 10000000.0;
}
private:
__int64 GetUserTime() const;
__int64 m_userTime;
};
__int64 Timing::GetUserTime() const {
FILETIME creationTime;
FILETIME exitTime;
FILETIME kernelTime;
FILETIME userTime;
GetThreadTimes(GetCurrentThread(),
&creationTime, &exitTime,
2. &kernelTime, &userTime);
__int64 curTime;
curTime = userTime.dwHighDateTime;
curTime <<= 32;
curTime += userTime.dwLowDateTime;
return curTime;
}
void Timing::StartTiming() {
m_userTime = GetUserTime();
}
void Timing::StopTiming() {
__int64 curUserTime = GetUserTime();
m_userTime = curUserTime - m_userTime;
}
This class is based on the GetThreadTimes function that allows you to separate running time of user
code from running time of system functions. The class is intended for estimate of running time of a
thread in user mode, so we use only the returned parameter lpUserTime.
Now consider a code sample where some number is calculated. We will use the Timing class to measure
the running time.
void UseTiming1()
{
Timing t;
t.StartTiming();
unsigned sum = 0;
for (int i = 0; i < 1000000; i++)
{
char str[1000];
for (int j = 0; j < 999; j++)
3. str[j] = char(((i + j) % 254) + 1);
str[999] = 0;
for (char c = 'a'; c <= 'z'; c++)
if (strchr(str, c) != NULL)
sum += 1;
}
t.StopTiming();
printf("sum = %un", sum);
printf("%.3G seconds.n", t.GetUserSeconds());
}
Being presented in this form, the timing mechanism behaves as it was expected and gives, say, 7
seconds on my machine. The result is correct even for a multi-core machine since it does not matter
which cores will be used while the algorithm is running (see Figure 1).
Figure 1 - Work of one thread on a multi-core computer
Now imagine that we want to use capabilities of multi-core processors in our program and estimate
benefits we will get from parallelizing the algorithm relying on the OpenMP technology. Let's parallelize
our code by adding one line:
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < 1000000; i++)
{
char str[1000];
4. for (int j = 0; j < 999; j++)
str[j] = char(((i + j) % 254) + 1);
str[999] = 0;
for (char c = 'a'; c <= 'z'; c++)
if (strchr(str, c) != NULL)
sum += 1;
}
The program now prints the running time 1.6 seconds. Since we use a 4-core computer, I feel like saying
"Hurray! We've got a 4-time speed-up and timing confirms this".
But really it is not so good: we are not measuring the running time of the algorithm. We are measuring
the running time of the main thread instead. In this case, the measuring seems reliable because the
main thread was working for the same time as the secondary threads. Let's carry out a simple
experiment: we will explicitly specify 10 threads to be used instead of 4:
#pragma omp parallel for reduction(+:sum) num_threads(10)
Logic says that this code must work for about the same time as the code parallelized into 4 threads. We
have a four-core processor, so we should expect that a larger number of threads will cause only
slowdown. Instead, we will see the result about 0.7 seconds.
This is an expected result although we wanted to get quite a different thing. We created 10 threads.
Each of them was working for about 0.7 seconds. It is time of the main thread, whose running time is
measured with the Timing class, ran for. As you can see, this method cannot be used to measure speed
of programs with parallel code fragments. Let's make it clearer by presenting it graphically in Figure 2.
Figure 2 - This is how work of 10 threads might look on a four-core computer
Of course, we may well use the time() function but its resolution is low and it will not allow you to
separate the running time of user code from that of system code. There may be other processes
influencing the time, which may also significantly distort timing.
5. A favorite timing function of many developers is QueryPerformanceCounter. Let's measure speed using
this function. In a simple form, the timing class looks this way:
class Timing2 {
public:
void StartTiming();
void StopTiming();
double GetUserSeconds() const {
return value;
}
private:
double value;
LARGE_INTEGER time1;
};
void Timing2::StartTiming()
{
QueryPerformanceCounter(&time1);
}
void Timing2::StopTiming()
{
LARGE_INTEGER performance_frequency, time2;
QueryPerformanceFrequency(&performance_frequency);
QueryPerformanceCounter(&time2);
value = (double)(time2.QuadPart - time1.QuadPart);
value /= performance_frequency.QuadPart;
}
Unfortunately, we cannot do so on a multi-core computer anymore. :) Let's read the description of this
function in MSDN:
6. On a multiprocessor computer, it should not matter which processor is called. However, you can get
different results on different processors due to bugs in the basic input/output system (BIOS) or the
hardware abstraction layer (HAL). To specify processor affinity for a thread, use the
SetThreadAffinityMask function.
Let's improve the code and tie the main thread to one core:
class Timing2 {
public:
void StartTiming();
void StopTiming();
double GetUserSeconds() const {
return value;
}
private:
DWORD_PTR oldmask;
double value;
LARGE_INTEGER time1;
};
void Timing2::StartTiming()
{
volatile int warmingUp = 1;
#pragma omp parallel for
for (int i=1; i<10000000; i++)
{
#pragma omp atomic
warmingUp *= i;
}
oldmask = SetThreadAffinityMask(::GetCurrentThread(), 1);
QueryPerformanceCounter(&time1);
7. }
void Timing2::StopTiming()
{
LARGE_INTEGER performance_frequency, time2;
QueryPerformanceFrequency(&performance_frequency);
QueryPerformanceCounter(&time2);
SetThreadAffinityMask(::GetCurrentThread(), oldmask);
value = (double)(time2.QuadPart - time1.QuadPart);
value /= performance_frequency.QuadPart;
}
Readers might ask what for we need a strange loop that does nothing. Contemporary processors reduce
their frequency at low load. This loop preliminarily increases the processor's speed to the maximum and
therefore increases the precision of speed measurement a bit. Additionally, we are warming up all the
available cores.
The timing method we have shown involves the same drawback: we cannot separate the running time
of user code from that of system code. If there are other tasks running on a core at the same time, the
result might also be rather inaccurate. But it seems to me that this method still can be applied to a
parallel algorithm unlike GetThreadTimes.
Let's measure the results of Timing and Timing2 classes at various numbers of threads. For this purpose,
the OpenMP-directive num_threads(N) is used. Let's arrange the data in a table shown in Figure 3.
Figure 3 - Algorithm's running time in seconds measured with functions GetThreadTimes and
QueryPerformanceCounter on a four-core machine
As you may see, as long as the number of threads does not exceed the number of cores, the
GetThreadTimes function gives us a result similar to that of the QueryPerformanceCounter function,
which makes you think that measurement is correct. But if there are more threads, you cannot rely on
its result.
8. Unfortunately, the program prints varying values from launch to launch. I do not know how to make
measuring more accurate and correct. So I am waiting for your feedback and methods of correct timing
of parallel algorithms.
You may download the program text here (a project for Visual Studio 2005).