| CSC 207 | Algorithms and Object Oriented Design | Spring 2009 |
Summary: In this lab, you will experiment with various probing strategies for hash tables, as well as hash functions for strings.
mkdir somewhere/hashing
cd somewhere/hashing
cp
complains that it can't copy my solution files ... that's the
idea):
cp ~weinman/public_html/courses/CSC207/2009S/labs/code/hashing/*.java .
cp ~weinman/public_html/courses/CSC207/2009S/labs/code/hashing/hashing.ods .
cp ~weinman/public_html/courses/CSC207/2009S/labs/code/hashing/words.txt .
A hash function yields an index into an array where items in a collection are stored. If that index is occupied, then we must search for the item. A linear probing sequence simply steps sequentially through the array for the next unoccupied slot. A quadratic probing sequence takes larger and larger steps through the array to find an unoccupied slot. In class we discussed some theoretical properties of these strategies. In this lab you will do your own empirical studies of their effectiveness.
There are a few important classes you'll be working with:
HashSetHashSet class given in the
book. It features (or will feature):
WordDataString objects. Its main purpose
is to randomly re-order (by using a RandomPermutation
class) and evenly split a list of words for inserting and querying
a HashSet. It
implements Iterable<String>, which means you
can use it easily in an enhanced for loop. When
passed a file name, it automatically reads words from the file
(assuming they are separated by newlines).
HashedStringString and does little
else. Its main purpose is to override hashCode with
your own hashing function for strings.
HashExperimentHashSet
of HashedStrings and inserts half the words into
it. Then it measures the average number of probes to query those
words (successfully) and query the other words (unsuccesfully).
javac *.javaIn the rest of the lab, you will probably want to easily compile any classes you modify directly within Emacs by using
HashSet.java in Emacs and examine the three
parameter constructor (starting at line 41) to be sure you
understand what the parameters mean.
HashExperiment.java in Emacs and examine the
construction of the HashSet (line 30). There are
24,193 words in dataSplit[0], how big must the array
in the hash set be (at least)? Why might it not be exactly this
size?
HashExperiment to get a feel for its operation:
java HashExperiment words.txt
HashSet says it can do linear probing, the
default version does not. Locate the method findPos
in HashSet. This is the routine that actually
determines where an object is or should be in the table.
findPos so that when the member
variable useLinearProbe is true, the
offset amount you described above is used. (Note: this really
requires only one extraordinarily simple line of code.)
HashExperiment so that the
HashSet it creates uses linear probing (line 30).
HashSet. Re-run HashExperiment. You
should see your results change, though not substantially.
HashSet.
HashExperiment so that it constructs
a HashSet a load-factor of 0.8 with linear probing.
HashSet. Locate the add method. What
happens when the number of occupied locations is greater than half
the array length?
add so that the table is not expanded
(rehashed) when items are added.
HashSet and re-run your experiment. Now do
the results agree with your predictions?
add this way so that the load factor does not
change as you proceed with the following experiments.
HashExperiment so that it constructs
a HashSet a load-factor of 0.8 with quadratic probing.
HashExperiment (lines 40-42), there is a
conditional and error statement in case the word is not
found. Why? (If you are unsure, consider the assumptions in
Theorem 20.4, p. 724.)
HashExperiment to incorporate a
simple for loop, if you can correctly write that faster than it
would take you to do it manually.)
HashExperiment to do a for loop, you
need not revert to a single experiment format (in fact, it may be
more interesting!) for this exercise.
HashedString.java in Emacs and locate
the hashCode method. This method (or one equivalent
to it) is the one used by Java
for String.hashCode(). It is similar to the one in
the book, given in Figure 20.2 (p. 716).
int type? How many in
a char?
ints before bits start "falling off" the left
end?
HashedString so that it multiplies by 256 instead of 31.
HashExperiment with a HashSet
having a load factor of 0.2 using quadratic probing (you may
keep a loop over load factors if you have one; you'll just get
additional experimental data).
HashedString so that it multiplies by 255 and
re-run HashExperiment with the same parameters as
before (e.g. in 6(i)).
HashedString so that it multiplies by 64 and
re-run HashExperiment. Are the results as good as you expected?
So the results of your experiments in this exercise are actually a little more nuanced than one might think. Remember that our goal is to have equitably distributed hash codes.
In this case, a short explanation for a desirable hash function is that we would like to multiply by something that is both small (to keep information from early characters around) AND relatively prime to the storage size of the hash code. Why the latter condition?
The properties of our language, combined with the ASCII character set is that, some bits (of the 8) in characters are somewhat "predictable." Others are harder to guess. If we wish our hash codes to be well-distributed throughout the table, we'd like this unpredictability to be used in all 32 bits of the hash code.
To accomplish this, we need the 8 bits in a character to be "mixing" (adding) with different bits from other characters. That is, we want the 3rd bit to combine with the 5th bit sometimes, and the 2nd bit to combine with the 8th bit, too. Really, we'd like as many of the 8 bit slots for characters to get added to as many of the other bit slots as possible, throughout the loop.
Thus, in order for our bits to mix well, we need to multiply by "weird" (relatively prime) numbers.