Laboratory: Hash Tables
 CSC 207 Algorithms and Object Oriented Design Spring 2011

Summary: In this lab, you will experiment with various probing strategies for hash tables, as well as hash functions for strings.

# Preparation

1. Create a directory for this lab:
`  mkdir somewhere/hashing`
2. Go to your new directory:
`  cd somewhere/hashing`
3. Copy the files for this lab (and don't worry if `cp` complains that it can't copy my solution files ... that's the idea):
`  cp ~weinman/public_html/courses/CSC207/2011S/labs/code/hashing/*.java .`
`  cp ~weinman/public_html/courses/CSC207/2011S/labs/code/hashing/hashing.ods .`
`  cp ~weinman/public_html/courses/CSC207/2011S/labs/code/hashing/words.txt .`

# Background

A hash function yields an index into an array where items in a collection are stored. If that index is occupied, then we must search for the item. A linear probing sequence simply steps sequentially through the array for the next unoccupied slot. A quadratic probing sequence takes larger and larger steps through the array to find an unoccupied slot. In class we discussed some theoretical properties of these strategies. In this lab you will do your own empirical studies of their effectiveness.

There are a few important classes you'll be working with:

`HashSet`
A modified version of the `HashSet` class given in the book. It features (or will feature):
• Both linear and quadratic probing sequences
• Specifiable capacity (number of items to store) and load factor (density of the array) at construction.
• Counting/resetting the number of probes made during any operation
`WordData`
A class collecting `String` objects. Its main purpose is to randomly re-order (by using a `RandomPermutation` class) and evenly split a list of words for inserting and querying a `HashSet`. It implements `Iterable<String>`, which means you can use it easily in an enhanced `for` loop. When the constructor is passed a file name, it automatically reads words from the file (assuming they are separated by newlines).
`HashedString`
A class that HAS-A `String` and does little else. Its main purpose is to override `hashCode` with your own hashing function for strings.
`HashExperiment`
A driver class that makes use of all of the above. It reads a list of words, splits it in half, creates a `HashSet` of `HashedString`s and inserts half the words into it. Then it measures the average number of probes to query those words (successfully) and query the other words (unsuccesfully).

# Exercises

## Exercise 1

1. Compile all of the files from the command line using javac:
` javac *.java`
In the rest of the lab, you will probably want to easily compile any classes you modify directly within Emacs by using C-c C-v C-c.
2. Open `HashSet.java` in Emacs and examine the three parameter constructor (starting at line 45) to be sure you understand what the parameters mean.
3. Open `HashExperiment.java` in Emacs and examine the construction of the `HashSet` (line 41). There are 24,193 words in `dataSplit[0]`, how big must the array in the hash set be (at least)? Why might it not be exactly this size?
4. Run `HashExperiment` to get a feel for its operation:
` java HashExperiment words.txt`
5. Do the values for the average number of probes seem good to you? Why or why not?

## Exercise 2

1. While `HashSet` says it can do linear probing, the default version does not. Locate the method `findPos` in `HashSet`. This is the routine that actually determines where an object is or should be in the table.
2. Note that to implement quadratic probing, the offset amount (the change from the previous probe location) is iteratively increased by two (cf. Eq. 20.4, p. 726). What should the offset amount be (and how might it change after a probe) for a linear probing strategy?
3. Update `findPos` so that when the member variable `useLinearProbe` is `true`, the offset amount you described above is used. (Note: this really requires only one extraordinarily simple line of code.)
4. Change `HashExperiment` so that the `HashSet` it creates uses linear probing (line 41).
5. Compile `HashSet`. Re-run `HashExperiment`. You should see your results change, though not substantially.

## Exercise 3

1. Using the equations in Theorem 20.3, calculate the expected number of probes for successful and unsuccessful searches using the linear probing strategy with the load factor of 0.2 currently used by the `HashSet`.
2. How well do these agree with the results from your second run of the experiment? (The one that actually used linear probing.)
3. Using the same equations, calculate the expected number of probes for a load factor of 0.8.
4. Change `HashExperiment` so that it constructs a `HashSet` a load-factor of 0.8 with linear probing.
5. Re-compile and run the experiment. Do the results agree with your predictions?
6. You may or may not recall a handy implementation detail of `HashSet`. Locate the `add` method. What happens when the number of occupied locations is greater than half the array length?
7. Change `add` so that the table is not expanded (rehashed) when items are added.
8. Re-compile `HashSet` and re-run your experiment. Now do the results agree with your predictions?
9. Leave `add` this way so that the load factor does not change as you proceed with the following experiments.

## Exercise 4

1. Change `HashExperiment` so that it constructs a `HashSet` a load-factor of 0.8 with quadratic probing.
2. Re-run the experiment. You should notice a new message output.
3. When searching for words we added to the hash set in `HashExperiment` (lines 52-54), there is a conditional and error statement in case the word is not found. Why? (If you are unsure, consider the assumptions in Theorem 20.4, p. 724.)

## Exercise 5

1. Open the spreadsheet hashing.ods:
`\$ oocalc hashing.ods`
2. Run experiments to fill in the missing data for the 6 load factors using both quadratic and linear probing. (Note: You may find it helpful to modify `HashExperiment` to incorporate a simple for loop, if you can correctly write that faster than it would take you to do it manually.)

## Exercise 6

1. If you modified `HashExperiment` to do a for loop, you need not revert to a single experiment format (in fact, it may be more interesting!) for this exercise.
2. Open `HashedString.java` in Emacs and locate the `hashCode` method. This method (or one equivalent to it) is the one used by Java for `String.hashCode()`. It is similar to the one in the book, given in Figure 20.2 (p. 716).
3. How many bits are in Java's `int` type? How many in a `char`?
4. Recall that multiplying by 256=28 is the same as shifting left by 8 bits. Let's say we change the 31 in the hashing function to 256. How many times can we shift left (multiply), using `int`s before bits start "falling off" the left end?
5. After that many multiplications, are there any bits left of the original 8 bits (from a character) in the integer?
6. What is the net effect for the hash code of words that end in the same four characters?
7. Would this be a very good hashing function?
8. Change `HashedString` so that it multiplies by 256 instead of 31.
9. Re-run `HashExperiment` with a `HashSet` having a load factor of 0.2 using quadratic probing (you may keep a loop over load factors if you have one; you'll just get additional experimental data).
10. How does the performance of this hash function compare to the original?
11. One reason for the poor performance is the dramatic aliasing between words noted in 6(f). Let's say we multiplied by slightly less, then we should be keeping at least a few of the bits from earlier characters around for longer, right?
12. Change `HashedString` so that it multiplies by 255 and re-run `HashExperiment` with the same parameters as before (e.g. in 6(i)).
13. How does the number of probes compare to the hash function using the 256 multiplier? How do they compare to the original function using 31?
14. If making the number small keeps around more bits from lower numbers, then we should be just multiplying by a smaller number, right? How should the results compare to your previous runs if we use a multiplier of 64 instead?
15. Change `HashedString` so that it multiplies by 64 and re-run `HashExperiment`. Are the results as good as you expected?
16. Why are the results what they are? (Hint: remember 64=2^6; what of the effect considered in Exercise 6.(e-f)?
17. How do you suppose the results might change if we used 63 instead? Try it! How do they compare to your original results with 31?

# Epilogue

So the results of your experiments in this exercise are actually a little more nuanced than one might think. Remember that our goal is to have equitably distributed hash codes.

In this case, a short explanation for a desirable hash function is that we would like to multiply by something that is both small (to keep information from early characters around) AND relatively prime to the storage size of the hash code. Why the latter condition?

The properties of our language, combined with the ASCII character set is that, some bits (of the 8) in characters are somewhat "predictable." Others are harder to guess. If we wish our hash codes to be well-distributed throughout the table, we'd like this unpredictability to be used in all 32 bits of the hash code.

To accomplish this, we need the 8 bits in a character to be "mixing" (adding) with different bits from other characters. That is, we want the 3rd bit to combine with the 5th bit sometimes, and the 2nd bit to combine with the 8th bit, too. Really, we'd like as many of the 8 bit slots for characters to get added to as many of the other bit slots as possible, throughout the loop.

Thus, in order for our bits to mix well, we need to multiply by "weird" (relatively prime) numbers.

Created: Jerod Weinman, 20 March 2009