The document provides instructions for experimentally determining the average-case time complexity of a permutation algorithm. It describes running the pdriver program with different input sizes N and repetitions R to collect timing data, which is recorded in a spreadsheet. The timing data is then analyzed by calculating the time per call T and comparing it to various complexity functions f(N) to determine which best fits the observed trends in the T/f(N) values.
3.1 AnalysisUse the copy-and-paste method to give an average.docx
1. 3.1 Analysis
Use the copy-and-paste method to give an average-case analysis
for the
permute
function in
permute.cpp
.
Assume that the random number generator used in the algorithm
produces truly random numbers, i.e, for any integer
j
in the range
0
..
RAND_MAX
, any given call of
rand()
has a probability
1/RAND_MAX
of returning
j
.
The
rand()
function runs in
O(1)
time (worst-case and average).
2. Show all work!
Remember that you are doing
average
case analysis.
If, at some point, you don’t ask how the average case behavior
differs from the worst case behavior, you’re doing it wrong!
3.2 Experiment
Check your analysis by running the algorithm on a variety of
input sizes and measuring the time it takes.
The program
pdriver.cpp
can be used as a “driver” to execute the
permute
algorithm. This program expects a pair of command line
arguments when run. The first argument is the number of items
to permute. The second is the number of times you want to
repeat the permutation process.
For example, if you compile this program to produce an
executable named “
pdriver
”, then
pdriver 50 10
will generate 10 permutations of 50 elements each.
3. Because the purpose of this exercise is to generate timing data
rather than to actually work with the algorithm, the generated
permutations are not printed. Feel free to add output statements
if you want to see the algorithms in action, but remember to
remove those statements before proceeding on to the timing
steps. (I/O is slow and may distort your timing results.)
Compile the program.
Run the program for different sizes of N and time the algorithm.
On Linux systems, you can measure the run time of the
programs (or of any Unix program/command) by placing the
command “
time
” in front of the program invocation. For example,
time ./pdriver 50 10
will determine the time required to generate 10 permutations of
50 elements each.
The time will appear in a format similar to this:
real 0m5.530s user 0m3.362s sys 0m0.090s
The last two numbers are of the most interest to us. Their sum is
the number of seconds the CPU devoted to execution of this
program. This is often much less than the “clock time” (the
third number), because the CPU is usually being shared by
several different programs.
4. The
user
time is the amount of time spent in “user” code.
The
sys
time is the amount of time spent in “system” calls.Together,
these consititute the “
CPU time
” used by the command.
The
real
time is the actual time that passed from when the program
started to when it ended. This can be much longer than the CPU
time.
We are interested in the CPU time because the other portion of
the real time is being spent running other processes or programs
other than the one were are timing.
An important factor to keep in mind is that we are looking for
average
times. If we ran the algorithm only a single time for a
particular value of N, depending upon how the random numbers
came out, we might get an unusually fast or an unusually slow
run. We need to do multiple repetitions for each value of N so
that we can get the
average time of a single run of size N
. That’s why the main driver function of this program is
designed to allow you to request multiple repetitions of the
algorithm for a fixed N.
Another reason for doing averages over many repetitions is to
reduce the effect of measurement errors – always a possibility
5. in any experiment. Even computer clocks have a finite
resolution. For small values of N this algorithm might run in a
fraction of one clock “tick”. Another source of measurement
error in this kind of experiment is the fact that we can’t run just
the permute algorithm in isolation. We have to run a whole
program that launches the permute algorithm, so some of our
measured time will be due to code other than the algorithm we
are interested in.
Timings of less than 10 seconds are likely to be dominated by
the overhead of starting and stopping the program, so you
should adjust the repetition count (the second argument of the
permute program) to make sure that all your timed runs take at
least that many seconds. A minimum of 60 seconds is probably
a good target.
In a non-Unix environment, getting the timing information is
harder. You can, of course, time the program with a good old-
fashioned stopwatch, but you’ll need to take special care to be
sure that your own physical reaction time doesn’t affect the
results. In addition, you are then measuring clock time, not CPU
time. We have already discussed the dangers of using clock
time.I strongly recommend, therefore, that you do this
experiment on our Linux servers or on a Linux (or MacOS)
machine of your own, if you have one.
For each value of N that you choose to use, determine an
appropriate value of R, a number of repetitions of the
permute
algorithm that brings the total time of
./pdriver N R
6. to a reasonable level (i.e., at least 60 seconds, but not so much
longer that you get tired of waiting for your data.)
Record the values N, R, and TT (the total CPU time observed)
in a spreadsheet, e.g.,
N | 100 | 200 | 400
R| 10000 | 5000 | 2000
TT | 60.52 | 56.5 | 110.2
2Each value of N should occur in only one row, and the N’s
need to be presented in ascending order. The values shown here
are for illustration only – the proper selection of values for N
and R is discussed below,
What values of N should you use? Here are some things to keep
in mind:
Big-O is for “sufficiently large N”. For small values of N, the
largest term in the overall complexity might not yet be
dominating any lower-order terms. In addition, you have certain
constant and O(R) factors involved in launching the program
and looping around calling the
permute
function. For small values of N, these might distort your
results.
You are going to be looking for trends in the data. That requires
a fair number of points, but even more important than the total
7. number of data points is that those points be spread widely
apart. You should make sure that your values of N range over
many orders of magnitude. (An
order of magnitude
is a power of ten.)
It
is
possible to get too large. When your arrays get so large that the
virtual memory system begins swapping parts of them out of
RAM to disk, your timing results will suddenly become very
erratic. Your total array size should probably be kept well under
the amount of RAM available to you.
3.3 Evaluation
In
this lecture
we looked at a procedure for confirming a predicted
complexity. In this assignment, we will employ a slightly
expanded form of the same technique.
In your spreadsheet, make sure that your rows are sorted into
increasing values of N. Then add a 4th column in which you
compute T,
1
the time to execute a single call to the
permute
function. This should be (approximately) TT/R:
N | 100 | 200 | 400
R | 10000 | 5000 | 2000
TT | .006052 | .0113 | .0551
8. Now, from
this lecture
, you should recall that,
if a function if O(f(N)), then T/f(N) should be roughly constant
over all N.
if f(N) is actually too large, then we should see a trend where
T/f(N) gets smaller as N grows.
if f(N) is too small, then we should see a trend where T/f(N)
gets larger as N grows.
These three observations allow us to “bracket” a function that
we think represents the actual complexity.
Start with the function f(N) that you predicted from your
analysis. Add two columns to your spreadsheet. The first should
compute the f(N) value, and the second should compute T/f(N).
For example, if you predicted that the function was O(N3)O(N3)
on average, then you might have
N | 100 | 200 | 400
R | 10000 | 5000 | 2000
TT | 60.52| 56.5 | 110.2
T | .006052 | .0113 | .0551
9. N^3 | 1000000 | 8000000 | 64000000
T/N^3 | 6.052E-09 | 1.413E-09 | 8.609-10
In both the T column and various f(N) and T/f(N) columns, use
the
spreadsheet
to compute values. Don’t compute them with a calculator or
other program on the side and then manually enter them into the
spreadsheet columns. Calculation is what spreadsheets are for!
(Besides, it’s much easier for me to check your formulas than to
have to recompute your results whenever I see something that
looks a bit fishy.)
Look at the results you have. Is the T/f(N) column nearly
constant? If not, what is the trend? If your f(N) is too large,
repeat this procedure with smaller functions (e.g. if N3N3 is too
large, try N2N2). If your f(N) is too small, repeat this procedure
with larger functions (e.g. if N3N3 is too small, try N4N4). You
can also try larger functions by multiplying by logNlogN,
e.g., if you need something larger than O(N)O(N), try
O(NlogN)O(NlogN).
Each function that you try should be shown as an additional two
columns in the spreadsheet.
Once you think you have found the true complexity f(N), make
sure that you have “bracketed it” by adding a function slightly
smaller than f(N) and a function slightly larger than f(N). (You
may already have one or both in your spreadsheet already from
step 2. If so, you don’t need to add them again.)
10. 4 Report
Prepare a report on your analysis and experimental results. This
report may be in a plain-text
.txt
file or in a PDF
.pdf
file. (Most word processors will allow you to save or export to
PDF.)
The report should include
Your copy-and-paste analysis
Your spreadsheet with your tabulated timing data
To the document containing your analysis, add brief discussion
of how your experimental results match up with your prediction.