In the last in-class lab, you should have defined likelihood models
for typist one and two. Review these definitions in starter.rkt
and note in particular what you called them.
B. Language Modeling
Require the file hmm.rkt at the top of your code.
Using create-counts again, define
a count structure to store the language (character) transition model
for our Bayesian network.
Use count-transitions! (twice) to add the transitions from
the two original (uncorrupted) letters 4 and 16 to your structure
from B.2.
Using get-count and char->index, determine how many
times "t" was followed by "h" and how often "q" was followed
by "u". Note: read the postconditions for count-transitions!
very carefully to make sure you use the arguments in the right order!
How do these raw frequency counts compare?
Use normalize-counts! again
to transform your transition counts into proper conditional probabilities.
Using get-count again, determine P(C=h | C′=t,L={ 4.txt,16.txt} ,I)
and P(C=u | C′=q,L={ 4.txt,16.txt} ,I).
How do the probabilities compare? Based on the frequencies you saw,
are these probabilities surprising? Explain the apparent discrepancy.
Use marginal-counts to define
a vector containing the total number of times each character appears
in the two letters. (But see the note below for a detail/caveat.1)
Make sure you insert the marginalization before the normalization
of the counts matrix from B.5!
Use normalize-marginal-counts!
to transform your marginal vector from B.7
into a proper marginal probability.
C. Marginal Sequential Inference
We already know letter 14 was transcribed by typist one. However,
we don't have the original, so we can't use log-likelihood
(from charmodel.rkt) as before to compare
P(O=typed.14.txt | C=orig.14.txt,D=16.txt,T=1)
and
P(O=typed.14.txt | C=orig.14.txt,D=4.txt,T=2).
However, we now have a language model to help us infer the original
values of C (and marginalize them) for comparison.
Use log-evidence and the values
you calculated in A, B.5, and B. 8
to calculate the log marginal likelihood for the observed letter if
typed by typist one:
Use count-errors-list to determine
how many errors are there in the restored version of the file you
just calculated. Dissatisfied? The homework may help resolve your
concerns.
For now, assuage your concerns
by using most-likely-sequence to restore letter 8 as if typist
two had transcribed it:
1If you initialized the structure with a 1 to use the rule of succession,
then each character's frequency will be overrepresented in your count,
making the marginal distribution a little more uniform than it might
otherwise be. As you'll see later, that "smoothing" may actually
be an advantage. For now, simply note that we've broken the rules
a bit, but we will tolerate that to make our programming just a bit
easier.