Pointers, Functions, and Arrays

Summary

You will write a program to graph the distribution of squared values using functions, pointers, and arrays.

Objectives

Learn to apply good functional decomposition
Practice working with arrays and pointers
Apply development principles in support of testing
Have fun choosing your own adventure

Background

A histogram is a useful statistical tool that helps us visualize how often a value from a sample falls into a certain category. We also sometimes like to know what percentage of values in a sample is at or below a particular threshold, which is another graph called the cumulative distribution.

In this assignment you'll be writing a program that randomly generates a sample, or set of "unpredictable" values, and discretizes these random values into bins or buckets so we can visually "graph" how many values fall into each bucket. To emphasize the ease of code reuse when practicing proper functional decomposition (and do some more array processing), you'll also create and visualize a transformation of your original data view.

You won't be surprised to find out there is a mechanism for generating unpredictable (aka random) numbers between 0 and 1 in most programming languages, including C. One might ask whether the numbers produced are reasonably well-distributed over the intended range. If they are, we would expect to see a roughly equal number of values in each of several equally sized bins.

For example, if we divided the values between 0.0 and 1.0 (inclusive) into ten bins corresponding to the ranges [0,0.1), [0.1,0.2), ..., [0.9,1.0], where a square bracket indicates the value it bounds is included in the bin but a parenthesis indicates it is excluded (so that every value has a unique destination bin), we can count how many times a random value falls into each bin and display the final counts with a screen marker.

Of course, if we placed a mark for each value in the sample, we would not get a very good visualization because each "bin" would have a bar that far exceeded our screen's dimensions. Instead, we figure out the maximum size of our graph and scale the data so that the fullest bin occupies the most space, and the rest of the bin counts are scaled relative to the fullest bin.

Taking this approach, such a histogram of uniform random numbers with a sample size of just 100 and ten bins looks like the following.

***************************************************
*********************************************************************
**********************************************
********************************************************************************
***************************************************************
**************************************************************************
*****************************
***************************************************
**********************************************
***************************************************************

This graph does not look very uniform; that is, the counts vary widely. This variance is probably due to our very small sample size. What happens if we increase the number of samples one-hundred-fold? While we're at it, let's up the granularity of our measurements by halving our bin size. The result looks like this.

*******************************************************************************
********************************************************************************
*************************************************************************
****************************************************************************
*****************************************************************************
***************************************************************************
**************************************************************************
*****************************************************************************
****************************************************************************
*******************************************************************************
**************************************************************************
*************************************************************************
**************************************************************************
*******************************************************************************
***********************************************************************
*******************************************************************************
***********************************************************************
****************************************************************************
************************************************************************
***************************************************************************

You're not going to get gold stars in your lab notebook for that graph, but it looks much closer to what we expect. Simulation for the win!

Another interesting quantity is the so-called cumulative distribution. What's that? Well, the first thing you graph would be the the count in the first bin. The second thing you graph would be the sum of the first and second bins. The third thing you graph is the sum of the first, second, and third, bins. The fourth ... well, you get the picture. Because each of our bins should be roughly the same size, we should expect a nice linear relationship in our cumulative distribution graph. Let's visualize the cumulative of the sample above, also with 20 bins.

****
********
************
****************
********************
************************
****************************
********************************
************************************
*****************************************
********************************************
************************************************
****************************************************
********************************************************
************************************************************
****************************************************************
********************************************************************
************************************************************************
****************************************************************************
********************************************************************************

Looking good! Now, what do we need to do? Listing the tasks in the order they might come to mind, we need to

Draw graphs. Two of 'em. Better make a function for that.
Generate samples. How do we get random values? Better look that up.
Discretize values. That is, given a number in [0,1], how do we figure out which bin it belongs in?
Make a cumulative sum. Sounds like another good opportunity for a function to abstract this specific task.
Test all these pieces to make sure they work!

We'll probably think of some other stuff along the way, but often writing out comments and working from the top down to the details as we stub out functions helps us figure that out. Now, let's go to work.

Assignment

Place all of your code in a single file called graph.c. However, rather than graph the uninteresting uniform distribution shown above, you will graph the distribution when the uniformly distributed value is squared. (That is, get a random number in [0,1] and square it; look at the histogram and cumulative distribution of those squared values.) Any idea what that graph will look like? Ponder it for a moment, then let's get started.

Getting Started

Begin by writing out comments within main for the major top-level steps indicated above:
- run a simulation that generates counts of samples
- calculate the cumulative distribution of those counts
- draw two graphs
We don't want any magic numbers in your program (values that appear as constant, devoid of context or explanation). Thus, in main define constant variables (a misnomer if there ever was one) in main using the const keyword for things like the sample size and maximum bar width (or the number of characters on a line). For example:
```
const unsigned int NUM_PLANETS = 8; /* Don't get Pluto started! */
```
Likewise, the number of bins is going to control your array variable length, so anywhere you declare an array of that length you will want to use a preprocessor macro with a #define directive. (Recall this helps the compiler know exactly how much space to reserve for your array on the stack, whereas even a constant variable won't allow the compiler to know the size.)
What other data will you need to store in your progam? Looks like you'll need two arrays: one to hold the counts and one to hold the cumulative sums. Declare those in main now with clear, sensible names and use the right types (Will they hold decimal values? Negative values?)
Good. Now give the top-level tasks some sensible function names, determine their parameters (you'll probably need to use some of the consts you just declared), and write stubs for them. Don't forget to write a descriptive comment above each stub declaration
Getting there. Write some pre-conditions and post-conditions for these procedures to the best of your ability. Once you're done with that, use the assert procedure to verify the preconditions. If you realize you forgot some necessary parameters, you can add them now.
Finish up by adding calls to these procedures where they belong in main.

Time to sanity check. Compile and run your program skeleton so far and make sure things are on the up and up.

Diving Deeper

You've now got a basic outline, but we need to do some more stubbing (no toes, please) before we really get to work.

We discovered above we need a way to generate a random number in the range [0,1]. Let's keep our focus on organizing the solution first and worry about details later. Stub out a function (with comments) to get that random number. Comment it, write your pre/post conditions and use assertions to verify them.
We also need a way to take a number (say in the range [0,1] because squaring any such number keeps it inside that range) and figure out which bin it belongs to. (That is, which array index gets incremented.) Stub out a function to make that determination, and do that whole commenting and asserting thing here, too.
How will you scale your graph to fit on the screen? You need to find out how many items are in the fullest bucket. To do this, stub out a function that takes an array and returns the address of the largest element in the array. (Comment! Assert! Do NOT implement! Remember that hours of coding can save minutes of planning.)

Compile and run your newly enhanced skeleton so far and make sure things are running smoothly.

Nearly There ... Nearly

Now let's start to fill in some details of our functions.

So, about those random numbers. See page 686 in King for info on the rand function and RAND_MAX macro, which you can combine to get the double you'll need. Armed with information, complete the details of the function you stubbed in #7 above.
But wait. Did you do it right? Let's set up a regime for at least running through some tests. (We don't have a lovely unit testing harness like you hopefully used in Scheme, but we'll do our best.)
1. Stub a new procedure to hold tests for your sampling function in #10.
2. Stub a new procedure to hold all your tests (with comments!). Add a call to your procedure from #10a here.
3. Now, do the things we need to do to convince ourselves this code is correct: enumerate the circumstances, list the test cases and outcomes. (Randomness is tricky, so make sure you use srand to make your results repeatable.)
And now would be a great time to run your program and see if everything is aces.
This time, let's try things backwards to help us think through all the scenarios of our discretization function for #8.
1. Stub out a new procedure to hold the discretization test suite for the function in #8.
2. Add this function call to the function with your overall testing regime.
3. Now, come up with the tests that will convince you your discretization process is correct. That is, enumerate the circumstances and list some specific inputs and outputs to try. Be sure you test the special cases and boundaries!
Hopefully thinking about the tests helped you understand better how to write the function. Now, go ahead and fill in the details for making the calculations in your discretization function in #8. Then, hold your breath and run your tests to see if all your code remains up to snuff.
Repeat the test-driven development approach for the maximum-finding procedure you stubbed in #8. That is, write the tests first, then write the function.

The Final Details

The last pieces of your infrastructure are finally coming together. Soon we'll start to really see some results.

Write and test your graph-drawing function. Draw one horizontal bar for every entry in the array, and make sure to fill the screen (some pre-set maximum width) with the largest bar. For example, the array {1,2,3,4} should look like the following (if the width is 80).
```
********************
****************************************
************************************************************
********************************************************************************
```
Now test and implement (yes! in that order! write tests then implement) your cumulative distribution function.
All the pieces are almost there. Go ahead and fill in the details (using the functions you've written) your simulation function from all the way back in #4. Remember, this is the function that should populate the histogram array.

Did we forget anything? It seems you should pretty much be there. Try it out and see whether the shape of the distribution matches your expectations.

Make sure to add some titles to each graph so we know what we're looking at and they don't just run together.

Next time, it will be up to you to think about how to practice good program development techniques and incrementally solve the problem all on your own.

Grading

In addition to the general grading guidelines for evaluation, the assignment is worth 25 points.

[13 points] Main program

[1 point] Function produces random number in [0,1]
[2 points] Function correctly discretizes numbers
[2 points] Function correctly returns address of array maximum
[2 points] Function correctly calculates cumulative array sums
[2 points] Function correctly generates a sample histogram
[2 points] Function correctly draws a scaled graph from array
[2 points] Program graphs labeled histogram and cumulative

Evaluation of program Structure, Comments, Design, and Testing.
(Points not given, but points can be deducted.)
[12 points] Testing and Development
- [2 points] Functions give clear and complete pre- and post-conditions
- [2 points] Assertions verify pre-conditions
- [2 points] Test plan enumerates the ranges of problem circumstances
- [2 points] Tests compare cases to be considered with the expected outcome
- [2 points] Transcript records actual test runs
- [2 points] Summary statement explains why the program is correct
[≤5 points] Extra Credit
You may earn extra credit by enhancing your program in one significant way. Good possibilities include:
- horizontal/x-axis labels (i.e., evenly divide the range of the function being graphed and print labels along the bottom of the graph, centered at appropriate places)
- vertical/y-axis labels (print the bound of each line's bin to the left of the bar and be sure to adjust the bar width accordingly to avoid line-wrapping)
- use function pointers to run simulations and graphs for a variety of values

A five point penalty will apply to any submission that includes personally identifying information anywhere other than the references/honesty file.