This document describes a C programming assignment to simulate and visualize the Central Limit Theorem. The program will take random samples, calculate averages, store the averages in an array, and create a histogram of the averages to display the normal distribution as a bell curve. It provides details on functions to calculate means and variances, populate an array with averages from multiple runs, create the histogram by binning values and counting frequencies, and print the histogram to the screen. Constant values are suggested for parameters like the number of samples, runs, bins, and scaling factors.
This programming assignment will exercise your C programming muscles w.docx
1. This programming assignment will exercise your C programming muscles while also enabling
you to visualize an essential fact about statistics: the Central Limit Theorem _. According to this
theorem, if you take the average over many random samples, and do this many times, those
averages will conform to a pattern known as the normal distribution .. If you then plot the
averages, with their values along the X axis and the frequency with which each range of values
occurs along the Y axis, the plot will take on the familiar "bell curve" shape. We can produce a
similar shape by means of a histogram, like a horizontal bar chart, appearing as rows of Xs ,
where the number of Xs in each row represents the number of times that an average over random
samples fell into the range represented by that row. Each row in this hisogram is labeled with the
central value of the range of averages represented by that row. The number of Xs corresponds to
the frequency with which an average falls into that interval. Below the histogram, the program
prints the overal mean over all the averages as well as the sample variance, or mean over the
squared differences between an actual average and the mean over the averages. Details We start
by computing the mean over many random samples: get_mean_of_uniform_random_samples.
Define a symbolic constant, SAMPLES, for the number of (pseudo-)random samples to be taken
to make up this mean. Then call the C Standard Library function rand , declared in stdlib.h, that
number of times. Accumulate the sum of those values - as a double - and then, after all iterations
are complete, divide the sum by SAMPLES. Now, rand returns an int value in the range [0,
RAND_MAX], where RAND_MAX is an implementation-dependent but big integer value. Such
an integer value is not exactly what we need for our purposes. Instead, we need a rational
number value between 1 and 1 and centered around 0 , so that wen can aim for a mean of 0 . To
get such a value, we transform the results of rand with some straightforward arithmetic: - Divide
the result by RAND_MAX. using rational number division (In order to do this, cast at least one
operand to type double and store the result in a variable of type double.) This gives a rational
number value in [ 0.0 , 1.0 ] . - Multiply by 2.0. This gives a rational number value in [ 0.0 , 2.0 ]
. - Subtract 1.0. This gives the desired value in [ 1.0 , 1.0 ] . Be sure to cast both dividend and
divisor to type double before the first division to ensure rational number division rather than
integer division, and use variables and constants of type double from that point onward. Recall
that the compiler reads a numeric literal with a decimal point in it as having type double. Create
a main function so that you can call get_mean_of_uniform_random_samples and print the result
to test whether it looks reasonable. You might want to call the function in a loop and print the
values so that you can see whether they tend toward 0 , which should be the mean over the
means. Next, we need a function to call get_mean_of_uniform_random_samples repeatedly for
RUNS times (where we define RUNS as another symbolic constant) and stores the return values
in an array, values, passed in as an outparameter. Since we are already iterating over RUNS, this
function should also accumulate a sum over all those averages, and then, after populating the
values array (containing the averages), divide the sum by RUNS (using rational number division
- which will happen automatically so long as sum has type double) and return it as the mean over
the sample averages. So this function is called populate_values_and_get_mean. It reutrns the
mean over the averages as a double. To test your program thus far, allocate an array of RUNS
elements, each of type double, in main and call populate_values_and_get_mean. Use malloc _ or
calloc B _to allocate the required memory space on the heap. (Note that malloc takes one
parameter, the number of bytes to be allocated. DO NOT USE A NUMERIC LITERAL! Use
arithmetic in the argument to compute the number of bytes. calloc takes two arguments, the
number of elements in the array and the number of bytes in each element. Again, use
computational methods to arrive at the latter value. calloc has the advantage of zero-initializing
2. the array, but, in this particular case, you do not need to have that, since you never read the
elements of values until after you have set the values in populate_values_and_get_mean.) Then
call populate_values_and_get_mean, storing the returned mean and passing in values as an out-
parameter. Verify that the result looks right. A separate function, get_mean_squared_error, takes
the values array and the mean calculated by a call to populate_values_and_get_mean - as in-
parameters - and returns the mean squared error over the values. Each error is the difference
between the actual value in values and the mean passed in. Iterating through values, square the
error and add that result to the running sum. Then return the sum divided by RUNS. Again, make
sure that you do rational number arithmetic. Again, test this function from main. Another
function, create_histogram, takes the values array as an in-parameter and an array of int, counts,
as an outparameter, and populates the counts with counts of the values occurring within the
bounds to be defined for the respective ranges ("bins" (or "buckets") associated with the
elements of counts. So counts[0] holds the count of values (averages) occurring within the lowest
range, counts[1] the count of values occurring within the second-lowest range, etc. There are
BINS elements in counts, i.e., BINS "bins" into which we imagine tossing the values, where
counts holds the numbers of values contained in those respective "bins." The general idea is to
iterate over the values. For each value, we iterate over the bins, finding the lower and upper
bounds for the current bin. If the value occurs within those bounds, we increment that count and
then break out of the inner loop, since we know that the value will not belong in any other bin.
There is a slight additional wrinkle to creating the histogram: as it happens, the more random
samples you take on each run in RUNS, the closer the average for that run will approach the
overall mean (i.e., 0), so values get bunched toward the center, which tends to obscure the "bell
curve" appearance of the graph. To compensate, we shrink the range of bins of interest down
from the original interval [ 1 , 1 ] to a smaller interval. The size of this smaller interval is
BIN_SPAN. So the interval of interest for us is [-BIN_SPAN/2, BIN_SPAN/2], and the size of
each bin, bin_size, is BIN_SPAN / (double)BINS. Putting these two concepts together, we can
write the inner loop of create_histogram along these lines: double bin_start; for ( j = 0 , bin_start
= -(BIN_SPAN /2.0 ) ; j < BINS; + + j , bin_start + = bin_size ) { // ... } On each iteration, of
course, bin_end is bin_start + bin_size. (NOTE: for obscure reasons, you have to declare both j
and bin_start outside of the loop - otherwise, you will get incorrect behavior.) Finally,
print_histogram takes the counts array as an in-parameter and prints the rows of Xs , one row per
element of count. It also must print out the labels. In order to print the labels, it follows a logic
similar to that of the inner loop in create_histogram, because the label, on each iteration, is
bin_start + bin_size / 2. Here, too, there is one wrinkle: the row of Xs can get inconveniently
long and cause the bell to look distorted, or wrap-around lines can make a mess of it. For this
reason, we scale the number of X in count down by dividing it by SCALE.The number of Xs we
actually print has to be an integer, so we use plain integer division to scale down the number
from the raw counts[i] value. All the symbolic constants named here are subject to
experimentation to give a satisfactory result. Below are the values that I have found satisfactory
through trial and error: - SAMPLES - the number of (pseudo-)random samples on each run -
10,000. - RUNS - the number of times we take an average over SAMPLES random numbers -
50,000. - BINS - the number of subintervals into which to classify the averages and count their
frequencies - 64. - BIN_SPAN - the size of the smaller interval chosen for display in the
histogram 0.05 . - SCALE - factor by which to divide each count when printing Xs onto the
console so as to fit the histogram neatly onto the screen 32