Summary: In this lab, you will experiment with various probing strategies for hash tables, as well as hash functions for strings.

Preparation

Create a directory for this lab:
```
  mkdir somewhere/hashing
```
Go to your new directory:
```
  cd somewhere/hashing
```

Copy the files for this lab (and don't worry if cp complains that it can't copy my solution files ... that's the idea):

  cp ~weinman/public_html/courses/CSC207/2009S/labs/code/hashing/*.java .

  cp ~weinman/public_html/courses/CSC207/2009S/labs/code/hashing/hashing.ods .

  cp ~weinman/public_html/courses/CSC207/2009S/labs/code/hashing/words.txt .

Background

A hash function yields an index into an array where items in a collection are stored. If that index is occupied, then we must search for the item. A linear probing sequence simply steps sequentially through the array for the next unoccupied slot. A quadratic probing sequence takes larger and larger steps through the array to find an unoccupied slot. In class we discussed some theoretical properties of these strategies. In this lab you will do your own empirical studies of their effectiveness.

There are a few important classes you'll be working with:

HashSet

A modified version of the HashSet class given in the book. It features (or will feature):

Both linear and quadratic probing sequences
Specifiable capacity (number of items to store) and load factor (density of the array) at construction.
Counting/resetting the number of probes made during any operation

WordData

A class collecting String objects. Its main purpose is to randomly re-order (by using a RandomPermutation class) and evenly split a list of words for inserting and querying a HashSet. It implements Iterable<String>, which means you can use it easily in an enhanced for loop. When passed a file name, it automatically reads words from the file (assuming they are separated by newlines).

HashedString

A class that HAS-A String and does little else. Its main purpose is to override hashCode with your own hashing function for strings.

HashExperiment

A driver class that makes use of all of the above. It reads a list of words, splits it in half, creates a HashSet of HashedStrings and inserts half the words into it. Then it measures the average number of probes to query those words (successfully) and query the other words (unsuccesfully).

Exercises

Exercise 1

Compile all of the files from the command line using javac:
```
 javac *.java
```
In the rest of the lab, you will probably want to easily compile any classes you modify directly within Emacs by using C-c C-v C-c.
Open HashSet.java in Emacs and examine the three parameter constructor (starting at line 41) to be sure you understand what the parameters mean.
Open HashExperiment.java in Emacs and examine the construction of the HashSet (line 30). There are 24,193 words in dataSplit[0], how big must the array in the hash set be (at least)? Why might it not be exactly this size?
Run HashExperiment to get a feel for its operation:
```
 java HashExperiment words.txt
```
Do the values for the average number of probes seem good to you? Why or why not?

Exercise 2

While HashSet says it can do linear probing, the default version does not. Locate the method findPos in HashSet. This is the routine that actually determines where an object is or should be in the table.
Note that to implement quadratic probing, the offset amount (the change from the previous probe location) is iteratively increased by two (cf. Eq. 20.4, p. 726). What should the offset amount be (and how might it change after a probe) for a linear probing strategy?
Update findPos so that when the member variable useLinearProbe is true, the offset amount you described above is used. (Note: this really requires only one extraordinarily simple line of code.)
Change HashExperiment so that the HashSet it creates uses linear probing (line 30).
Compile HashSet. Re-run HashExperiment. You should see your results change, though not substantially.

Exercise 3

Using the equations in Theorem 20.3, calculate the expected number of probes for successful and unsuccessful searches using the linear probing strategy with the load factor of 0.2 currently used by the HashSet.
How well do these agree with the results from your second run of the experiment? (The one that actually used linear probing.)
Using the same equations, calculate the expected number of probes for a load factor of 0.8.
Change HashExperiment so that it constructs a HashSet a load-factor of 0.8 with linear probing.
Re-compile and run the experiment. Do the results agree with your predictions?
You may or may not recall a handy implementation detail of HashSet. Locate the add method. What happens when the number of occupied locations is greater than half the array length?
Change add so that the table is not expanded (rehashed) when items are added.
Re-compile HashSet and re-run your experiment. Now do the results agree with your predictions?
Leave add this way so that the load factor does not change as you proceed with the following experiments.

Exercise 4

Change HashExperiment so that it constructs a HashSet a load-factor of 0.8 with quadratic probing.
Re-run the experiment. You should notice a new message output.
When searching for words we added to the hash set in HashExperiment (lines 40-42), there is a conditional and error statement in case the word is not found. Why? (If you are unsure, consider the assumptions in Theorem 20.4, p. 724.)

Exercise 5

Open the spreadsheet hashing.ods.
Run experiments to fill in the missing data for the 6 load factors using both quadratic and linear probing. (Note: You may find it helpful to modify HashExperiment to incorporate a simple for loop, if you can correctly write that faster than it would take you to do it manually.)

Exercise 6

If you modified HashExperiment to do a for loop, you need not revert to a single experiment format (in fact, it may be more interesting!) for this exercise.
Open HashedString.java in Emacs and locate the hashCode method. This method (or one equivalent to it) is the one used by Java for String.hashCode(). It is similar to the one in the book, given in Figure 20.2 (p. 716).
How many bits are in Java's int type? How many in a char?
Recall that multiplying by 256=2⁸ is the same as shifting left by 8 bits. Let's say we change the 31 in the hashing function to 256. How many times can we shift left (multiply), using ints before bits start "falling off" the left end?
After that many multiplications, are there any bits left of the original 8 bits (from a character) in the integer?
What is the net effect for the hash code of words that end in the same four characters?
Would this be a very good hashing function?
Change HashedString so that it multiplies by 256 instead of 31.
Re-run HashExperiment with a HashSet having a load factor of 0.2 using quadratic probing (you may keep a loop over load factors if you have one; you'll just get additional experimental data).
How does the performance of this hash function compare to the original?
One reason for the poor performance is the dramatic aliasing between words noted in 6(f). Let's say we multiplied by slightly less, then we should be keeping at least a few of the bits from earlier characters around for longer, right?
Change HashedString so that it multiplies by 255 and re-run HashExperiment with the same parameters as before (e.g. in 6(i)).
How does the number of probes compare to the hash function using the 256 multiplier? How do they compare to the original function using 31?
If making the number small keeps around more bits from lower numbers, then we should be just multiplying by a smaller number, right? How should the results compare to your previous runs if we use a multiplier of 64 instead?
Change HashedString so that it multiplies by 64 and re-run HashExperiment. Are the results as good as you expected?
Why are the results what they are? (Hint: remember 64=2^6; what of the effect considered in 6.(e-f)?
How do you suppose the results might change if we used 63 instead? Try it! How do they compare to your original results with 31?

So the results of your experiments in this exercise are actually a little more nuanced than one might think. Remember that our goal is to have equitably distributed hash codes.

In this case, a short explanation for a desirable hash function is that we would like to multiply by something that is both small (to keep information from early characters around) AND relatively prime to the storage size of the hash code. Why the latter condition?

The properties of our language, combined with the ASCII character set is that, some bits (of the 8) in characters are somewhat "predictable." Others are harder to guess. If we wish our hash codes to be well-distributed throughout the table, we'd like this unpredictability to be used in all 32 bits of the hash code.

To accomplish this, we need the 8 bits in a character to be "mixing" (adding) with different bits from other characters. That is, we want the 3rd bit to combine with the 5th bit sometimes, and the 2nd bit to combine with the 8th bit, too. Really, we'd like as many of the 8 bit slots for characters to get added to as many of the other bit slots as possible, throughout the loop.

Thus, in order for our bits to mix well, we need to multiply by "weird" (relatively prime) numbers.

Created: Jerod Weinman, 20 March 2009