0$. Since $\sum_{j=1}^{n}D^{-x_{j}}=1$,
our original problem will have a \emph{stationarising}
solution when
\[D^{-x_{j}}=p_{j},\ \text{that is to say}
\ x_{j}=-\frac{\log p_{j}}{\log D}\]
and
\[\sum_{j=1}^{n}p_{j}x_{j}=
-\sum_{j=1}^{n}p_{j}\frac{\log p_{j}}{\log D}.\]
\end{proof}
It is not hard to convince oneself that the stationarising
solution just found is, in fact, maximising, but it is an
unfortunate fact that IB variational calculus is suggestive
rather than conclusive.
The next two exercises
(which will be done in lectures and form part of the course)
provide a rigorous proof.
\begin{exercise}\label{E;Gibbs} (i) Show that
\[\log t\leq t-1\]
for $t>0$ with equality if and only if $t=1$.
(ii) {\bf [Gibbs' inequality]}
Suppose that $p_{j},q_{j}>0$ and
\[\sum_{j=1}^{n}p_{j}=\sum_{j=1}^{n}q_{j}=1.\]
By applying~(i) with $t=q_{j}/p_{j}$, show that
\[\sum_{j=1}^{n}p_{j}\log p_{j}\geq \sum_{j=1}^{n}p_{j}\log q_{j}\]
with equality if and only if $p_{j}=q_{j}$.
\end{exercise}
\begin{exercise} We use the notation of
Problem~\ref{P;Compress variation}.
(i) Show that, if $x_{j}^{*}=-\log p_{j}/\log D$, then
$x_{j}^{*}>0$ and
\[\sum_{j=1}^{n}D^{-x_{j}^{*}}=1.\]
(ii) Suppose that $y_{j}>0$ and
\[\sum_{j=1}^{n}D^{-y_{j}}=1.\]
Set $q_{j}=D^{-y_{j}}$. By using Gibbs' inequality
from Exercise~\ref{E;Gibbs}~(ii), show that
\[\sum_{j=1}^{n}p_{j}x_{j}^{*}\leq \sum_{j=1}^{n}p_{j}y_{j}\]
with equality if and only if $y_{j}=x_{j}^{*}$ for all $j$.
\end{exercise}
Analysts use logarithms to the base $e$, but the importance
of two-symbol alphabets means that communication theorists
often use logarithms to the base $2$.
\begin{exercise}\label{E;Memory}
(Memory jogger.) Let $a,\,b>0$. Show that
\[\log_{a} b=\frac{\log b}{\log a}.\]
\end{exercise}
The result of Problem~\ref{P;Compress variation}
is so important that it gives rise to a definition.
\begin{definition}\label{D;Shannon entropy}
Let ${\mathcal A}$ be a
non-empty finite set and $A$ a random variable taking
values in ${\mathcal A}$. If $A$ takes the value $a$
with probability $p_{a}$ we say that the system has
\emph{Shannon entropy}\footnote{It is unwise for the beginner
and may or may not be fruitless for the expert
to seek a link with entropy in physics.}
(or \emph{information entropy})
\[H(A)=-\sum_{a\in {\mathcal A}}p_{a}\log_{2} p_{a}.\]
\end{definition}
\begin{theorem}\label{T;no better}
Let ${\mathcal A}$ and ${\mathcal B}$
be finite alphabets and let ${\mathcal B}$ have $D$ symbols. If
$A$ is an ${\mathcal A}$-valued random variable,
then any decodable code $c:{\mathcal A}\rightarrow{\mathcal B}$
must satisfy
\[{\mathbb E}|c(A)|\geq \frac{H(A)}{\log_{2} D}.\]
\end{theorem}
Here $|c(A)|$ denotes the length of $c(A)$.
Notice that the result takes a particularly
simple form when $D=2$.
In Problem~\ref{P;Compress variation}
the $x_{j}$ are just positive real numbers but in
Problem~\ref{P;Compress three} the $d_{j}$ are integers.
Choosing $d_{j}$ as close as possible to the best $x_{j}$
may not give the best $d_{j}$, but it is certainly worth a try.
\begin{theorem}\label{T;Shannon--Fano}{\bf[Shannon--Fano encoding]}
Let ${\mathcal A}$ and ${\mathcal B}$
be finite alphabets and let ${\mathcal B}$ have $D$ symbols. If
$A$ is an ${\mathcal A}$-valued random variable,
then there exists
a prefix-free (so decodable) code
$c:{\mathcal A}\rightarrow{\mathcal B}$
which satisfies
\[{\mathbb E}|c(A)|\leq 1+\frac{H(A)}{\log_{2} D}.\]
\end{theorem}
\begin{proof} By Lemma~\ref{L;Kraft 2} (which states
that given lengths satisfying
Kraft's inequality we can construct an associated
prefix-free code), it suffices to find strictly
positive integers $l_{a}$ such that
\[\sum_{a\in {\mathcal A}}D^{-l_{a}}\leq 1,
\ \text{but}
\ \sum_{a\in {\mathcal A}}p_{a}l_{a}\leq 1+\frac{H(A)}{\log_{2} D}.\]
If we take
\[l_{a}=\lceil -\log_{D} p_{a}\rceil,\]
that is to say, we take $l_{a}$ to be the smallest integer
no smaller than $-\log_{D} p_{a}$, then these conditions
are satisfied and we are done.
\end{proof}
It is very easy to use the method just indicated to
find an appropriate code. (Such codes are called Shannon--Fano
codes\footnote{Wikipedia and several other sources give
a definition of Shannon--Fano codes which is
definitely \emph{inconsistent} with that given here.
Within a Cambridge examination context you may assume
that Shannon--Fano codes are those considered here.}.
Fano was the professor who set the homework for Huffman.
The point of view adopted here means that for some
problems there may be more than one Shannon--Fano
code.)
\begin{exercise}\label{E;Fano}
(i) Let ${\mathcal A}=\{1,2,3,4\}$. Suppose that the
probability
that letter $k$ is chosen is $k/10$.
Use your calculator\footnote{If you have no calculator,
your computer has a calculator program. If you have
no computer, use log tables. If you are on a desert
island, just think.} to find $\lceil -\log_{2} p_{k}\rceil$
and write down an appropriate Shannon--Fano code $c$.
(ii) We found a Huffman code $c_{h}$ for the system
in Example~\ref{E;Huffman do}.
Show\footnote {Unless you are on a desert island
in which case the calculations are rather tedious.}
that the entropy is approximately $1.85$,
that ${\mathbb E}|c(A)|=2.4$
and that ${\mathbb E}|c_{h}(A)|=1.9$.
Check that these results are consistent with
our previous theorems.
\end{exercise}
Putting Theorems~\ref{T;no better} and Theorem~\ref{T;Shannon--Fano}
together,
we get the following remarkable result.
\begin{theorem}\label{T;Shannon noiseless}%
{\bf[Shannon's noiseless coding theorem]}
Let ${\mathcal A}$ and ${\mathcal B}$
be finite alphabets and let ${\mathcal B}$ have $D$ symbols. If
$A$ is a ${\mathcal A}$-valued random variable,
then any decodable code $c$ which minimises ${\mathbb E}|c(A)|$
satisfies
\[\frac{H(A)}{\log_{2} D}\leq
{\mathbb E}|c(A)|\leq 1+\frac{H(A)}{\log_{2} D}.\]
\end{theorem}
In particular, Huffman's code $c_{h}$ for two symbols satisfies
\[H(A)\leq
{\mathbb E}|c_{h}(A)|\leq 1+H(A).\]
\begin{exercise} (i) Sketch $h(t)=-t\log t$ for $0\leq t\leq 1$.
(We define $h(0)=0$.)
(ii) Let
\[\Gamma=\left\{{\mathbf p}\in{\mathbb R}^{n}
\,:\,p_{j}\geq 0,\ \sum_{j=1}^{n}p_{j}=1\right\}\]
and let $H:\Gamma\rightarrow{\mathbb R}$ be defined by
\[H({\mathbf p})=\sum_{j=1}^{n}h(p_{j}).\]
Find the maximum and minimum of $H$ and
describe the points where these values are attained.
(iii) If $n=2^{r}+s$ with $0\leq s< 2^{r}$ and $p_{j}=1/n$,
describe the Huffman code $c_{h}$ for two symbols
and verify directly that (with notation
of Theorem~\ref{T;Shannon noiseless})
\[H(A)\leq
{\mathbb E}|c_{h}(A)|\leq 1+H(A).\]
\end{exercise}
Waving our hands about wildly, we may say that
`A system with low Shannon entropy is highly
organised and, knowing the system, it is
usually quite easy to identify an individual from the
system'.
\begin{exercise} The notorious Trinity gang
has just been rounded up and
Trubshaw of the Yard wishes to identify the leader (or Master,
as he is called). Sam the Snitch makes the following offer.
Presented with any collection of members of the gang
he will (by a slight twitch of his left ear)
indicate if the Master is among them. However, in
view of the danger involved, he demands ten pounds
for each such encounter. Trubshaw believes that
the probability of the $j$th member of the gang being
the Master is $p_{j}$ $[1\leq j\leq n]$
and wishes to minimise the expected drain on the public purse.
Advise him.
\end{exercise}
\section{Non-independence}
(This section is non-examinable.)
In the previous sections we discussed codes
$c:{\mathcal A}\rightarrow{\mathcal B}^{*}$
such that, if a letter $A\in{\mathcal A}$ was chosen according to
some random law, ${\mathbb E}|c(A)|$ was about
as small as possible. If we choose $A_{1}$, $A_{2}$,
\dots independently according to the same law,
then it is not hard to convince oneself that
\[{\mathbb E}|c^{*}(A_{1}A_{2}A_{3}\dots A_{n})|
=n{\mathbb E}|c(A)|\]
will be as small as possible.
However, in re*l lif* th* let*ers a*e
often no* i*d*p***ent. It is sometimes possible
to send messages more
efficiently using this fact.
\begin{exercise}\label{E;Markov}
Suppose that we have a sequence
$X_{j}$ of random variables taking the values
$0$ and $1$. Suppose that $X_{1}=1$ with probability $1/2$
and
$X_{j+1}=X_{j}$ with probability
$.99$ independent of what has gone before.
(i) Suppose we wish to send ten successive bits
$X_{j}X_{j+1}\dots X_{j+9}$. Show that if we associate
the sequence of ten zeros with $0$, the sequence
of ten ones with $10$ and any other sequence
$a_{0}a_{1}\dots a_{9}$ with $11a_{0}a_{1}\dots a_{9}$
we have a decodable code which on average requires
about $5/2$ bits to transmit the sequence.
(ii) Suppose we wish to send the bits
$X_{j}X_{j+10^{6}}X_{j+2\times 10^{6}}\dots X_{j+9\times 10^{6}}$.
Explain why any decodable code will require
on average at least $10$ bits to transmit the sequence.
(You need not do detailed computations.)
\end{exercise}
If we transmit sequences of letters by forming
them into longer words and coding the words,
we say we have a block code. It is plausible that
the longer the blocks, the less important the
effects of non-independence. In more advanced courses
it is shown how to define entropy for systems
like the one discussed in Exercise~\ref{E;Markov}
(that is to say Markov chains) and that, provided
we take long enough blocks, we can recover
an analogue of Theorem~\ref{T;Shannon noiseless}
(the noiseless coding theorem).
In the real world, the problem lies deeper.
Presented with a photograph, we can instantly
see that it represents Lena wearing a hat.
If a machine reads the image pixel by pixel,
it will have great difficulty recognising
much, apart from the fact that the distribution
of pixels is `non-random' or has `low entropy'
(to use the appropriate hand-waving expressions).
Clearly, it ought to be possible to describe the photograph
with many fewer bits than are required to describe
each pixel separately, but, equally clearly,
a method that works well on black and white photographs
may fail on colour photographs and a method that works well
on photographs of faces may work badly
when applied to photographs
of trees.
Engineers have a clever way of dealing with this problem.
Suppose we have a sequence $x_{j}$ of zeros and ones
produced by some random process. Someone who believes that
they partially understand the nature of the process
builds us a prediction machine
which, given the sequence $x_{1}$, $x_{2}$, \dots $x_{j}$ so far,
predicts the next term will be $x_{j+1}'$. Now set
\[y_{j+1}\equiv x_{j+1}-x'_{j+1}\mod 2.\]
If we are given the sequence $y_{1}$, $y_{2}$, \dots
we can recover the $x_{j}$ inductively using the
prediction machine and the formula
\[x_{j+1}\equiv y_{j+1}+x'_{j+1}\mod 2.\]
If the prediction machine is good, then the
sequence of $y_{j}$ will consist mainly of zeros
and there will be many ways of encoding the sequence
as (on average) a much shorter code word.
(For example, if we arrange the sequence in blocks of
fixed length, many of the possible blocks will have very low
probability, so Huffman's algorithm will be very effective.)
Build a better mousetrap, and the world will beat a path to your door.
Build a better prediction machine and the world will beat your door down.
There is a further real world complication.
Engineers
distinguish between irreversible
`lossy compression'
and reversible `lossless compression'.
For compact discs, where bits are cheap,
the sound recorded can be reconstructed
exactly. For digital sound broadcasting, where
bits are expensive, the engineers make use
of knowledge of the human auditory system
(for example, the fact that we can not
make out very soft noise in the presence
of loud noises) to produce a result that might
sound perfect (or nearly so) to us, but which
is, in fact, not. For mobile phones, there can be
greater loss of data because users
do not demand anywhere close to
perfection.
For digital TV, the situation is still more
striking with reduction in data content from
film to TV of anything up to a factor of 60.
However, medical and satellite pictures must
be transmitted with no loss of data.
Notice that lossless coding can be judged by
absolute criteria, but the merits of lossy
coding can only be judged subjectively.
Ideally, lossless compression should
lead to a signal indistinguishable (from a statistical
point of view) from a random signal
in which the value of each bit is independent
of the value of all the others.
In practice, this is only possible in certain
applications. As an indication of the kind of
problem involved, consider TV pictures. If
we know that what is going to be transmitted is
`head and shoulders' or `tennis matches' or
`cartoons' it is possible to obtain extraordinary
compression ratios by `tuning' the compression method
to the expected pictures, but then changes from
what is expected can be disastrous. At present,
digital TV encoders merely expect the picture to
consist of blocks which move at nearly constant
velocity remaining more or less unchanged from
frame to frame\footnote{Watch what happens when
things go wrong.}. In this, as in other applications,
we know that after compression
the signal still has non-trivial
statistical properties,
but we do not know
enough about them to exploit them.
\section{What is an error correcting code?}
In the introductory Section~\ref{S;alphabets},
we discussed `telegraph codes' in which
one five letter
combination `QWADR', say, meant `please book
quiet room for two' and another `QWNDR' meant
`please book cheapest room for one'.
Obviously, also, an error of one letter
in this code could have unpleasant
consequences\footnote{This is a made up example,
since compilers of such codes understood the problem.}.
Today, we transmit and store long strings of
binary sequences, but face the same problem
that some digits may not be transmitted or stored correctly.
We suppose that the string is the result of data compression
and so, as we said at the end of the last
section, \emph{although the string
may have non-trivial
statistical properties,
we do not know
enough to exploit this fact}. (If we knew how to
exploit any statistical regularity, we could build a
prediction device and compress the data still further.)
Because of this, we shall assume that
we are asked to consider a collection
of $m$ messages \emph{each of which is equally
likely}.
Our model is the following. When the `source'
produces one of the $m$ possible messages $\mu_{i}$ say,
it is fed into a `coder' which outputs
a string $\mathbf{c}_{i}$ of $n$ binary digits.
The string is then transmitted one digit
at a time along a `communication channel'.
Each digit has probability $p$ of being mistransmitted
(so that $0$ becomes $1$ or $1$ becomes $0$)
independently of what happens to the other digits $[0\leq p<1/2]$.
The transmitted message is then passed through
a `decoder' which either produces a message $\mu_{j}$
(where we hope that $j=i$) or an error message
and passes it on to the `receiver'.
The technical term for our model is the
\emph{binary symmetric channel} (binary because
we use two symbols, symmetric because
the probability of error is the same whichever
symbol we use).
\begin{exercise}\label{Reverse} Why do we not consider
the case $1\geq p>1/2$? What
if $p=1/2$?
\end{exercise}
For most of the time we shall concentrate our attention
on a \emph{code} $C\subseteq\{0,1\}^{n}$ consisting
of the \emph{codewords} $\mathbf{c}_{i}$.
(Thus we use a fixed length code.) We say that
$C$ has \emph{size} $m=|C|$.
If $m$ is large then
we can send a large number of possible messages
(that is to say, we can send more information) but,
as $m$ increases, it becomes harder to distinguish
between different messages when errors occur.
At one extreme, if $m=1$, errors cause us no problems
(since there is only one message) but no information
is transmitted (since there is only one message).
At the other extreme, if $m=2^{n}$, we can transmit
lots of messages but any error moves us from one
codeword to another. We are led to the following
rather natural definition.
\begin{definition}\label{information rate}
The information rate of $C$ is
$\dfrac{\log_{2} m}{n}$.
\end{definition}
Note that, since $m\leq 2^{n}$ the information rate
is never greater than $1$. Notice also that the values of
the information rate when $m=1$ and $m=2^{n}$ agree
with what we might expect.
How should our decoder work? We have assumed that
all messages are equally likely and that errors
are independent (this would not be true if, for
example, errors occurred in bursts\footnote{For
the purposes of this course we note
that this problem could
be tackled by permuting the `bits' of the message
so that `bursts are spread out'. In theory, we
could do better than this by using
the statistical properties of such bursts
to build a prediction machine.
In practice, this is rarely possible.
In the paradigm case of mobile phones,
the properties of the transmission channel
are constantly changing and are not well understood.
(Here the main restriction on the use of permutation
is that it introduces time delays. One way round this
is `frequency hopping' in which several users
constantly swap transmission channels `dividing
bursts among users'.) One desirable property of
codes for mobile phone users is that they should
`fail gracefully', so that as the error rate
for the channel rises the error rate for the receiver
should not suddenly explode.}).
Under these
assumptions, a reasonable strategy for our decoder
is to guess
that the codeword sent is one which differs
in the fewest places from the string of $n$
binary digits received. Here and elsewhere
the discussion can be illuminated by the
simple notion of a Hamming distance.
\begin{definition} If $\mathbf{x},\ \mathbf{y}\in \{0,1\}^{n}$,
we write
\[d(\mathbf{x},\mathbf{y})=\sum_{j=1}^{n}|x_{j}-y_{j}|\]
and call $d(\mathbf{x},\mathbf{y})$ the Hamming distance between
$\mathbf{x}$ and $\mathbf{y}$.
\end{definition}
\begin{lemma} The Hamming distance is a metric.
\end{lemma}
We now do some very simple IA probability.
\begin{lemma} We work with the coding and transmission scheme
described above.
Let ${\mathbf c}\in C$ and ${\mathbf x}\in \{0,1\}^{n}$.
(i) If $d({\mathbf c},{\mathbf x})=r$, then
\[\Pr({\mathbf x}\ \text{received given
${\mathbf c}$ sent})=p^{r}(1-p)^{n-r}.\]
(ii) If
$d({\mathbf c},{\mathbf x})=r$, then
\[\Pr({\mathbf c}\ \text{sent given
${\mathbf x}$ received})=A({\mathbf x})p^{r}(1-p)^{n-r},\]
where $A({\mathbf x})$ does not depend on $r$ or ${\mathbf c}$.
(iii) If ${\mathbf c}'\in C$ and
$d({\mathbf c}',{\mathbf x})\geq d({\mathbf c},{\mathbf x})$,
then
\[\Pr({\mathbf c}\ \text{sent given
${\mathbf x}$ received})\geq
\Pr({\mathbf c}'\ \text{sent given
${\mathbf x}$ received}),\]
with equality if and only if
$d({\mathbf c}',{\mathbf x})=d({\mathbf c},{\mathbf x})$.
\end{lemma}
This lemma justifies our use, both explicit
and implicit, throughout what follows of the so-called
\emph{maximum likelihood} decoding rule.
\begin{definition} The maximum likelihood decoding
rule states that a string ${\mathbf x}\in \{0,1\}^{n}$
received by a decoder should be decoded as
(one of) the codeword(s) at the smallest
Hamming distance from ${\mathbf x}$.
\end{definition}
Notice that, although this decoding rule is mathematically
attractive, it may be impractical if $C$ is large
and there is often no known way of finding the codeword at
the smallest distance from a particular ${\mathbf x}$
in an acceptable number of steps.
(We can always make a complete search through all the
members of $C$ but unless there are very special circumstances
this is likely to involve an unacceptable amount of work.)
\section{Hamming's breakthrough}
Although we have used simple probabilistic arguments
to justify it, the maximum likelihood decoding rule
will often enable us to avoid probabilistic considerations
(though not in the very important part of this concerned with
Shannon's noisy coding theorem) and concentrate on
algebra and combinatorics.
The spirit of most of the course is exemplified in the
next two definitions.
\begin{definition} We say that $C$ is $d$ \emph{error detecting}
if changing up to $d$ digits in a codeword never produces
another codeword.
\end{definition}
\begin{definition} We say that $C$ is $e$
\emph{error correcting}
if knowing that a string of $n$ binary digits differs
from some codeword of $C$ in at most
$e$ places we can deduce the codeword.
\end{definition}
Here are some simple schemes. Some of them use alphabets with
more than two symbols but the principles remain the same.
\noindent\emph{Repetition coding of length $n$}.
We take codewords of the form
\[{\mathbf c}=(c,c,c,\dots,c)\]
with $c=0$ or $c=1$. The code $C$ is $n-1$ error detecting,
and $\lfloor (n-1)/2\rfloor$ error correcting.
The maximum likelihood decoder chooses the
symbol that occurs most often.
(Here and elsewhere $\lfloor \alpha\rfloor$ is the largest
integer $N\leq\alpha$ and $\lceil \alpha \rceil$ is
the smallest integer $M\geq\alpha$.) Unfortunately,
the information rate is $1/n$ which is
rather low\footnote{Compare
the chorus `Oh no John, no John, no John, no'.}.
\noindent\emph{The Cambridge examination paper code}
Each candidate is asked to write down a Candidate Identifier of the
form
$1234A$, $1235B$, $1236C$,\ \dots
(the eleven\footnote{My guess.} possible letters are repeated cyclically)
and a desk number. The first four numbers
in the Candidate Identifier identify the candidate
uniquely. If the letter written by the
candidate does not correspond to
to the first four numbers the candidate is identified
by using the desk number.
\begin{exercise}\label{E;Cambridge exam}
Show that if the candidate makes one error
in the Candidate Identifier,
then this will be detected.
Would this be true if there were $9$ possible
letters repeated cyclically? Would this be true
if there were $12$ possible
letters repeated cyclically?
Give reasons.
Show that, if we also use the Desk Number
then the combined code Candidate Number/Desk Number
is one error correcting
\end{exercise}
\noindent\emph{The paper tape code.}
Here and elsewhere, it is convenient
to give $\{0,1\}$ the structure
of the field
${\mathbb F}_{2}={\mathbb Z}_{2}$
by using arithmetic modulo 2.
The codewords have the form
\[{\mathbf c}=(c_{1},c_{2},c_{3},\dots,c_{n})\]
with $c_{1}$, $c_{2}$, \dots, $c_{n-1}$ freely
chosen elements of ${\mathbb F}_{2}$ and
$c_{n}$ (the check digit) the element of
${\mathbb F}_{2}$ which gives
\[c_{1}+c_{2}+\dots+c_{n-1}+c_{n}=0.\]
The resulting code $C$ is $1$ error detecting
since, if ${\mathbf x}\in {\mathbb F}_{2}^{n}$
is obtained from ${\mathbf c}\in C$
by making a single error, we have
\[x_{1}+x_{2}+\dots+x_{n-1}+x_{n}=1.\]
However it is not error correcting
since, if
\[x_{1}+x_{2}+\dots+x_{n-1}+x_{n}=1,\]
there are $n$ codewords ${\mathbf y}$ with
Hamming distance $d({\mathbf x},{\mathbf y})=1$.
The information rate is $(n-1)/n$. Traditional
paper tape had 8 places per line each
of which could have a punched hole or not,
so $n=8$.
\begin{exercise}\label{ISBN}
If you look
at the inner
title page of almost any book published between
1970 and 2006
you will find its International Standard
Book Number (ISBN). The ISBN
uses single digits selected from 0, 1, \dots, 8, 9
and $X$ representing 10. Each ISBN consists
of nine such digits $a_{1}$, $a_{2}$, \dots, $a_{9}$
followed by a single check digit $a_{10}$ chosen
so that
\begin{equation*}
10a_{1}+9a_{2}+ \dots+2a_{9}+a_{10}\equiv 0\mod{11}.\tag*{(*)}
\end{equation*}
(In more sophisticated language, our code $C$ consists
of those elements ${\mathbf a}\in {\mathbb F}_{11}^{10}$
such that $\sum_{j=1}^{10}(11-j)a_{j}=0$.)
(i) Find a couple of books\footnote{In case of difficulty,
your college library may be of assistance.}
and check that $(*)$ holds for their ISBNs\footnote{In fact,
$X$ is only used in the check digit place.}.
(ii) Show that $(*)$ will not work if you make a mistake
in writing down one digit of an ISBN.
(iii) Show that
$(*)$ may fail to detect two errors.
(iv) Show that $(*)$ will not work if you interchange
two distinct adjacent digits (a transposition error).
(v) Does~(iv) remain true if we replace `adjacent'
by `different'?
\noindent Errors of type (ii) and (iv) are the most common
in typing\footnote{Thus a syllabus for an
earlier version of this
course contained the rather charming misprint
of `snydrome' for `syndrome'.}.
In communication between publishers and booksellers,
both sides are anxious that errors should be detected
but would prefer the other side to query errors
rather than to guess what the error might have been.
(vi) After January 2007, the appropriate ISBN is a $13$ digit number
$x_{1}x_{2}\dots x_{13}$ with each digit
selected from $0$, $1$,\,\dots, $8$, $9$ and
the check digit $x_{13}$ computed by using the formula
\[x_{13}\equiv -(x_{1}+3x_{2}+x_{3}+3x_{4}+\cdots+x_{11}+ 3x_{12})
\mod{10}.\]
Show that we can detect single errors. Give an example
to show that we cannot detect all transpositions.
\end{exercise}
Hamming had access to an early electronic computer
but was low down in the priority list of users.
He would submit his
programs encoded on paper tape to run over the
weekend but often he would have his tape returned
on Monday because the machine had detected an error
in the tape. `If the machine can detect an error'
he asked himself `why can the machine not correct it?'
and he came up with the following scheme.
\noindent\emph{Hamming's original code.}
We work in ${\mathbb F}_{2}^{7}$. The codewords
${\mathbf c}$ are chosen to satisfy the
three conditions
\begin{align*}
c_{1}+c_{3}+c_{5}+c_{7}&=0\\
c_{2}+c_{3}+c_{6}+c_{7}&=0\\
c_{4}+c_{5}+c_{6}+c_{7}&=0.
\end{align*}
By inspection, we may choose $c_{3}$, $c_{5}$, $c_{6}$
and $c_{7}$ freely and then $c_{1}$, $c_{2}$ and $c_{4}$
are completely determined. The information rate is
thus $4/7$.
Suppose that we receive the string
${\mathbf x}\in{\mathbb F}_{2}^{7}$.
We form the \emph{syndrome}
$(z_{1},z_{2},z_{4})\in{\mathbb F}_{2}^{3}$
given by
\begin{align*}
z_{1}&=x_{1}+x_{3}+x_{5}+x_{7}\\
z_{2}&=x_{2}+x_{3}+x_{6}+x_{7}\\
z_{4}&=x_{4}+x_{5}+x_{6}+x_{7}.
\end{align*}
If ${\mathbf x}$ is a codeword, then
$(z_{1},z_{2},z_{4})=(0,0,0)$.
If ${\mathbf c}$ is a codeword and
the Hamming distance $d({\mathbf x},{\mathbf c})=1$,
then the place in which ${\mathbf x}$ differs
from ${\mathbf c}$ is given by $z_{1}+2z_{2}+4z_{4}$
(using ordinary addition, not addition modulo $2$)
as may be easily checked using linearity
and a case by case study of the seven binary
sequences ${\mathbf x}$ containing one 1
and six 0s. The Hamming code is thus 1
error correcting.
\begin{exercise}\label{bacon}
Suppose we use eight hole tape with
the standard paper tape code
and the probability that an error occurs at a particular
place on the tape (i.e. a hole occurs where it should
not or fails to occur where it should) is $10^{-4}$.
A program requires about 10\,000 lines of tape
(each line containing eight places)
using the paper tape code. Using
the Poisson approximation, direct calculation
(possible with a hand calculator but really no
advance on the Poisson method), or otherwise,
show that the probability that the tape
will be accepted as error free by the decoder
is less than .04\%.
Suppose now that we use the Hamming scheme
(making no use of the last place in each line).
Explain why the program requires about
17\,500 lines of tape but that any
particular line will be correctly decoded
with probability about $1-(21\times 10^{-8})$
and the probability that the entire program
will be correctly decoded is better than
99.6\%.
\end{exercise}
Hamming's scheme is easy to implement. It
took a little time for his company to
realise what he had done\footnote{Experienced engineers
came away from working demonstrations
muttering `I still don't believe it'.}
but they were soon trying to patent it.
In retrospect, the idea of an error correcting
code seems obvious (Hamming's scheme had
actually been used as the basis of a Victorian
party trick) and indeed two or three other
people discovered it independently, but Hamming
and his
co-discoverers had done more than
find a clever answer to a question. They
had asked an entirely new question and
opened a new field for mathematics and engineering.
The times were propitious for the development
of the new field. Before 1940, error correcting
codes would have been luxuries, solutions
looking for problems, after 1950, with the
rise of the computer and new communication
technologies, they became necessities.
Mathematicians and engineers returning
from wartime duties in
code breaking, code making and
general communications problems were
primed to grasp and extend the ideas.
The mathematical engineer
Claude Shannon may be considered the
presiding genius of the new field.
The reader will observe that data compression
shortens the length of our messages by
removing redundancy and Hamming's scheme
(like all error correcting codes) lengthens
them by introducing redundancy. This is true,
but data compression removes redundancy which we
do not control and which is not useful
to us and error correction coding then
replaces it with carefully controlled
redundancy which we can use.
The reader will also note an analogy with ordinary language.
The idea of data compression is illustrated by the
fact that many common words
are short\footnote{Note how `horseless carriage'
becomes `car' and `telephone' becomes `phone'.}.
On the other hand the redund of ordin lang
makes it poss to understa it even if we do
no catch everyth that is said.
\section{General considerations} How good
can error correcting and error
detecting\footnote{If the error rate is low and it is easy
to ask for the message to be retransmitted, it may be
cheaper to concentrate on error detection. If there is
no possibility of retransmission (as in long term data storage),
we have to concentrate on error correction.}
codes be? The following discussion is a natural
development of the ideas we have already
discussed. Later, in our discussion of Shannon's noisy coding
theorem we shall see another
and deeper way
of looking at the question.
\begin{definition} The minimum distance $d$
of a code is the smallest Hamming distance
between distinct code words.
\end{definition}
We call a code of length $n$, size $m$ and
distance $d$ an $[n,m,d]$ code. Less briefly,
a set $C\subseteq{\mathbb F}_{2}^{n}$,
with $|C|=m$ and
\[\min\{d({\mathbf x},{\mathbf y}):
{\mathbf x},{\mathbf y}\in C,\ {\mathbf x}\neq{\mathbf y}\}
=d\]
is called an $[n,m,d]$ code. By an $[n,m]$ code
we shall simply mean a code of length $n$
and size $m$.
\begin{lemma}\label{minimum distance}
A code of minimum distance
$d$ can detect $d-1$ errors\footnote{This is not
as useful as it looks when $d$ is large. If we know
that our message is likely to contain many
errors, all that an error detecting code
can do is confirm our expectations.
Error detection is only useful when errors are
unlikely.} and correct
$\lfloor\frac{d-1}{2}\rfloor$ errors.
It cannot detect all sets of $d$ errors
and cannot correct all sets of
$\lfloor\frac{d-1}{2}\rfloor+1$ errors.
\end{lemma}
It is natural, here and elsewhere, to make use
of the geometrical insight provided by the
(closed) Hamming ball
\[B({\mathbf x},r)=\{{\mathbf y}:
d({\mathbf x},{\mathbf y})\leq r\}.\]
Observe that
\[|B({\mathbf x},r)|=|B({\boldsymbol 0},r)|\]
for all ${\mathbf x}$ and so, writing
\[V(n,r)=|B({\boldsymbol 0},r)|,\]
we know that $V(n,r)$ is the number of points in any
Hamming ball of radius $r$. A simple counting
argument shows that
\[V(n,r)=\sum_{j=0}^{r}\binom{n}{j}.\]
\begin{theorem}{\bf [Hamming's bound]}\label{Hamming's bound} If a
code $C$ is $e$ error correcting, then
\[|C|\leq \frac{2^{n}}{V(n,e)}.\]
\end{theorem}
There is an obvious fascination (if not utility)
in the search for codes which attain the
exact Hamming bound.
\begin{definition} A code $C$ of length $n$
and size $m$ which can correct $e$ errors
is called \emph{perfect} if
\[m=\frac{2^{n}}{V(n,e)}.\]
\end{definition}
\begin{lemma} Hamming's original code is
a $[7,16,3]$ code. It is perfect.
\end{lemma}
It may be worth remarking in this context
that, if a code which can correct $e$ errors
is perfect (i.e. has a perfect packing
of Hamming balls of radius $e$),
then the decoder must invariably
give the wrong answer when presented with
$e+1$ errors. We note also that,
if (as will usually be the case)
$2^{n}/V(n,e)$ is not an integer, no
perfect $e$ error correcting code can exist.
\begin{exercise}\label{Not perfect}
Even if $2^{n}/V(n,e)$ is an integer,
no perfect code may exist.
(i) Verify that
\[\frac{2^{90}}{V(90,2)}=2^{78}.\]
(ii) Suppose that $C$ is a perfect 2 error correcting
code of length $90$ and size $2^{78}$. Explain
why we may suppose without loss of generality
that ${\boldsymbol 0}\in C$.
(iii) Let $C$ be as in (ii) with ${\boldsymbol 0}\in C$.
Consider the set
\[X=\{{\mathbf x}\in{\mathbb F}_{2}^{90}:
x_{1}=1,\ x_{2}=1,\ d({\boldsymbol 0},{\mathbf x})=3\}.\]
Show that, corresponding to each ${\mathbf x}\in X$,
we can find a unique ${\mathbf c}({\mathbf x})\in C$
such that $d({\mathbf c}({\mathbf x}),{\mathbf x})=2$.
(iv) Continuing with the argument of (iii), show
that
\[d({\mathbf c}({\mathbf x}),{\boldsymbol 0})=5\]
and that $c_{i}({\mathbf x})=1$ whenever $x_{i}=1$.
If $\mathbf{y}\in X$,
find the number of solutions to the equation
${\mathbf c}({\mathbf x})={\mathbf c}({\mathbf y})$
with $\mathbf{x}\in X$
and, by considering the number of elements of $X$,
obtain a contradiction.
(v) Conclude that there is no perfect $[90,2^{78}]$ code.
\end{exercise}
The result of Exercise~\ref{Not perfect} was obtained by
Golay. Far more importantly, he found another case
when $2^{n}/V(n,e)$ is an integer and there
does exist an associated perfect code (the Golay code).
\begin{exercise}\label{Golay perfect}
Show that $V(23,3)$ is a power of $2$.
\end{exercise}
Unfortunately the proof that the Golay code is perfect is
too long to be given in the course,
We obtained the Hamming bound, which
places an upper bound on how good
a code can be, by a \emph{packing} argument.
A \emph{covering} argument gives
us the GSV (Gilbert, Shannon, Varshamov) bound
in the opposite direction. Let us write
$A(n,d)$ for the size of the largest code
with minimum distance $d$.
\begin{theorem} {\bf [Gilbert, Shannon, Varshamov]}\label{GSV}
We have
\[A(n,d)\geq \frac{2^{n}}{V(n,d-1)}.\]
\end{theorem}
Until recently there were no general explicit
constructions for codes which achieved the
GSV bound (i.e. codes whose minimum distance
$d$ satisfied the inequality $A(n,d)V(n,d-1)\geq 2^{n}$).
Such a construction
was finally found by Garcia and Stichtenoth by
using `Goppa' codes.
\section{Some elementary probability}
Engineers are, of course, interested in `best codes'
of length $n$ for reasonably small values of $n$,
but mathematicians are particularly interested
in what happens as $n\rightarrow\infty$.
We recall some elementary probability.
\begin{lemma} {\bf [Tchebychev's inequality]}
If $X$ is a bounded real valued random variable
and $a>0$, then
\[\Pr(|X-{\mathbb E}X|\geq a)\leq\frac{\Var X}{a^{2}}.\]
\end{lemma}
\begin{theorem} {\bf [Weak law of large numbers]}
If $X_{1}$, $X_{2}$, \dots is a sequence of independent
identically distributed real valued bounded
random variables and $a>0$, then
\[\Pr\left(\left|n^{-1}\sum_{j=1}^{n}X_{j}
-{\mathbb E}X\right|\geq a\right)\rightarrow 0\]
as $N\rightarrow\infty$.
\end{theorem}
Applying the weak law of large numbers,
we obtain the following important result.
\begin{lemma} Consider the model of a noisy transmission
channel used in this course in which each digit
has probability $p$ of being wrongly transmitted
independently of what happens to the other digits.
If $\epsilon>0$, then
\begin{small}
\[\Pr\big(\text{number of errors in transmission for
message of $n$ digits}\geq (1+\epsilon)pn\big) \rightarrow 0\]
\end{small}
as $n\rightarrow\infty$.
\end{lemma}
By
Lemma~\ref{minimum distance},
a code of minimum distance
$d$ can correct
$\lfloor\frac{d-1}{2}\rfloor$ errors.
Thus, if we have an error rate $p$
and $\epsilon>0$, we know that
the probability that a
code of length $n$ with error correcting capacity
$\lceil (1+\epsilon)pn\rceil$
will fail to correct a transmitted message
falls to zero as $n\rightarrow\infty$.
By definition, the biggest code with
minimum distance $\lceil 2(1+\epsilon)pn \rceil$
has size $A(n,\lceil 2(1+\epsilon)pn \rceil)$
and so has information rate
$\log_{2}A(n,\lceil 2(1+\epsilon)pn \rceil)/n$.
Study of the behaviour of $\log_{2}A(n,n\delta)/n$
will thus tell us how large an information
rate is possible in the presence of a given error
rate.
\begin{definition} If $0<\delta<1/2$ we write
\[\alpha(\delta)=\limsup_{n\rightarrow\infty}
\frac{\log_{2} A(n,n\delta)}{n}.\]
\end{definition}
\begin{definition} We define the entropy
function $H:[0,1]\rightarrow{\mathbb R}$
by $H(0)=H(1)=0$ and
\[H(t)=-t\log_{2}(t)-(1-t)\log_{2}(1-t).\]
\end{definition}
\begin{exercise} (i) We have already met
Shannon entropy in Definition~\ref{D;Shannon entropy}.
Give a simple system such that, using the notation
of that definition,
\[H(A)=H(t).\]
(ii) Sketch $H$. What is the value of $H(1/2)$?
\end{exercise}
\begin{theorem}\label{T;alpha}
With the definitions just given,
\[1-H(\delta)\leq\alpha(\delta)\leq 1-H(\delta/2)\]
for all $0\leq \delta<1/2$.
\end{theorem}
Using the Hamming bound (Theorem~\ref{Hamming's bound})
and the GSV bound (Theorem~\ref{GSV}), we see that
Theorem~\ref{T;alpha} follows at once from the
following result.
\begin{theorem}\label{log volume} We have
\[\frac{\log_{2}V(n,n\delta)}{n}
\rightarrow H(\delta)\]
as $n\rightarrow\infty$.
\end{theorem}
Our proof of Theorem~\ref{log volume} depends, as
one might expect, on a version of Stirling's formula.
We only need the very simplest version proved
in~IA.
\begin{lemma}[Stirling] We have
\[\log_{e} n!=n\log_{e}n-n+O(\log_{2}n).\]
\end{lemma}
We combine this with the remarks that
\[V(n,n\delta)=\sum_{0\leq j\leq n\delta}
\binom{n}{j}\]
and that very simple estimates give
\[\binom{n}{m}\leq \sum_{0\leq j\leq n\delta}
\binom{n}{j}
\leq (m+1)\binom{n}{m}\]
where $m=\lfloor n\delta\rfloor$.
Although the GSV bound is very important,
Shannon showed that
a stronger result can be obtained for the
error correcting power of the best long codes.
\section{Shannon's noisy coding theorem}
In the backstreets of Cambridge
(Massachusetts) there is a science museum devoted to
the glory of MIT. Since MIT has a great deal of glory
and since much thought has gone into the presentation
of the exhibits, it is well worth a visit. However,
for any mathematician, the highlight is a glass case
containing such things as a juggling machine,
an electronic calculator\footnote{THROBAC
the THrifty ROman numeral BAckwards-looking Computer.
Google `MIT Museum', go to `objects' and then
search `Shannon'.}
that uses Roman numerals both externally and internally,
the remnants of a machine built to guess which
of heads and tails its opponent would choose
next\footnote{That is to say a prediction machine.
Google `Shannon Mind-Reading Machine' for sites
giving demonstrations and descriptions of the underlying program.}
and a mechanical maze running mouse.
These objects were built by Claude Shannon.
In his 1937 master's thesis, Shannon showed how to
analyse circuits using Boolean algebra and binary arithmetic.
During the war he worked on gunnery control
and cryptography at Bell labs and in 1948 he published
\emph{A Mathematical Theory of Communication}%
\footnote{This beautiful paper is available on the web
and in his \emph{Collected Works}.}.
Shannon had several predecessors and many successors,
but it is his vision which underlies this course.
Hamming's bound together with
Theorem~\ref{T;alpha}
gives a very strong hint that it is not possible to
have an information rate greater than
$1-H(\delta)$ for an error rate $\delta<1/2$.
(We shall prove this explicitly in Theorem~\ref{T;Shannon converse}.)
On the other hand the GSV bound
together with
Theorem~\ref{T;alpha}
shows that it is always possible to
have an information rate greater than
$1-H(2\delta)$ for an error rate $\delta<1/4$.
Although we can use repetition codes
to get a positive information rate
when $1/4\leq \delta<1/2$ it looks
very hard at first (and indeed second) glance
to improve these results.
However, Shannon realised that
we do not care whether errors arise because
of noise in transmission or imperfections
in our coding scheme. By allowing our coding
scheme to be less than perfect
(in this connection, see Question~\ref{E;random ball})
we can
actually improve the information rate whilst
still keeping the error rate low.
\begin{theorem}{\bf [Shannon's noisy coding theorem]}\label{T;Shannon}
Suppose $00$.
Then there exists an $n_{0}(p,\eta)$ such that,
for any $n>n_{0}$, we can find codes of
length $n$ which have the property
that (under our standard model
of a symmetric binary
channel with probability of error $p$) the probability
that any codeword is mistaken is less than
$\eta$ and still have
information rate $1-H(p)-\eta$.
\end{theorem}
Shannon's theorem is a masterly display
of the power of elementary probabilistic
arguments to overcome problems
which appear insuperable by other
means\footnote{Conway says that in order to
achieve success in a mathematical field
you must either be first or be clever.
However, as in the case of Shannon,
most of those who are first
to recognise a new mathematical field
are also clever.}.
However, it merely asserts that good codes
exist and gives no means of finding
them apart from exhaustive search.
More seriously, random codes will have
no useful structure and the only way to use them
is to `search through a large dictionary'
at the coding end and
`search through an enormous dictionary'
at the decoding end.
It should also be noted that $n_{0}(p,\eta)$
will be very large when $p$ is close to $1/2$.
\begin{exercise} Why, in the absence of suitable
structure,
is the dictionary at the decoding
end much larger than the dictionary at the
coding end?
\end{exercise}
It is relatively simple to obtain a converse
to Shannon's theorem.
\begin{theorem}\label{T;Shannon converse}
Suppose $0

0$.
Then there exists an $n_{0}(p,\eta)$ such that,
for any $n>n_{0}$, it is impossible
to find codes of
length $n$ which have the property
that (under our standard model
of a symmetric binary
channel with probability of error $p$) the probability
that any codeword is mistaken is less than
$1/2$ and the code has
information rate $1-H(p)+\eta$.
\end{theorem}
As might be expected, Shannon's theorem
and its converse extend
to more general noisy channels (in particular, those
where the noise is governed by a Markov chain $M$).
It is possible to define the entropy $H(M)$
associated with $M$
and to show that the information
rate cannot exceed $1-H(M)$ but that any
information rate lower than $1-H(M)$ can
be attained with arbitrarily low error rates.
However, we must leave something for more
advanced courses, and as we said earlier,
it is rare in practice to have very clear
information about the nature of the noise
we encounter.
There is one very important theorem of Shannon
which is not covered in this course. In it, he reinterprets
a result of Whittaker to show that any continuous signal
whose Fourier transform vanishes outside a range of length
$R$ can be
reconstructed from its value at equally spaced
sampling points provided those points are less than $A/R$
apart. (The constant $A$ depends on the conventions used
in defining the Fourier transform.) This enables us
to apply the `digital' theory of information transmission
developed here to continuous signals.
\section{A holiday at the race track}\label{S;race track}
Although this section is examinable\footnote{When the
author of the present notes gives the course.
This is his interpretation of the sentence in the schedules
`Applications to gambling and the stock market.'
Other lecturers may view matters differently.},
the material is peripheral to the course.
Suppose a very rich friend makes you the following
offer. Every day, at noon, you may make a bet with her
for any amount $k$ you chose. You give her $k$
pounds which she keeps whatever happens.
She then tosses a coin
and, if it shows heads, she pays you $ku$
and, if it shows tails, she pays you nothing.
You know that the probability
of heads is $p$.
What should you do?
If $pu<1$, you should not bet, because your expected winnings are
negative. If $pu>1$, most mathematicians would
be inclined to bet, but how much? If you bet your entire fortune
and win, you will be better off than if you bet
a smaller sum, but, if you lose, then you are bankrupt
and cannot continue playing.
Thus your problem is to discover
the proportion $w$ of your present fortune
that you should bet. Observe that your choice of $w$
will always be the same (since
you expect to go on playing for ever). Only the size of your
fortune will vary. If your fortune after $n$ goes
is $Z_{n}$, then
\[Z_{n+1}=Z_{n}Y_{n+1}\]
where $Y_{n+1}=uw+(1-w)$ if the $n+1$st throw is heads
and $Y_{n+1}=1-w$ if it is tails.
Using the weak law of large numbers, we have the following
result.
\begin{lemma} Suppose $Y$, $Y_{1}$, $Y_{2}$, \dots are identically
distributed
independent random variables taking values in $[a,b]$ with
$0\epsilon)\rightarrow 0\]
as $n\rightarrow 0$.
\end{lemma}
Thus you should choose $w$ to maximise
\[{\mathbb E}\log Y_{n}=p\log\big(uw+(1-w)\big)+(1-p)\log(1-w).\]
\begin{exercise}\label{E;fast Kelly}
(i) Show that, for the situation described,
you should not bet if $up\leq 1$ and should take
\[w=\frac{up-1}{u-1}\]
if $up>1$.
(ii) We write $q=1-p$. Show that, if $up>1$ and we choose
the optimum $w$,
\[{\mathbb E}\log Y_{n}=p\log p+q\log q+\log u-q\log(u-1).\]
\end{exercise}
We have seen the expression $-(p\log p+q\log q)$ before
as (a multiple of)
the Shannon information entropy of a simple probabilistic system.
In a paper entitled
\emph{A New Interpretation of Information Rate}\footnote{Available
on the web. The exposition is slightly opaque because
the Bell company which employed Kelly was anxious
not draw attention to the use of telephones for
betting fraud.}
Kelly showed how to interpret this and similar
situations using communication theory. In his model
a gambler receives information over a noisy channel
about which horse is going to win. Just as Shannon's theorem
shows that information can be transmitted over
such a channel at a rate close to channel capacity
with negligible risk of error (provided the
messages are long enough), so that the gambler can
(with arbitrarily high probability)
increase her fortune at a certain optimum rate
provided that she can continue to bet long enough.
Although the analogy between betting and communication
channels is very pretty, it was the suggestion that
those making a long sequence of bets should aim to
maximise the expectation of the logarithm
(now called Kelly's criterion) which
made the paper famous. Although Kelly
seems never to have used his idea in practice,
mathematicians like Thorp, Berlekamp and Shannon himself
have made substantial fortunes in the stock market
and claim to have used Kelly's ideas\footnote{However,
we hear more about mathematicians who win on the
stock market than those who lose.}.
Kelly is also famous for an early demonstration of
speech synthesis in which a computer sang
`Daisy Bell'. This inspired the corresponding
scene in the film \emph{2001}.
Before rushing out to the race track or stock
exchange\footnote{A sprat which thinks it's a shark
will have a very short life.},
the reader
is invited to run computer simulations of the result
of Kelly gambling for various values of $u$ and $p$.
She will observe that although, \emph{in the very long run},
the system works, the short run can be be very
unpleasant indeed.
\begin{exercise}\label{E;Slow Kelly}
Returning to our original problem, show that,
if you bet less than the optimal proportion, your fortune
will still tend to increase but more slowly, but, if you bet
more than some proportion $w_{1}$, your fortune will decrease.
Write down the equation for $w_{1}$.
$[$Moral: If you use the Kelly criterion veer on the side under-betting.$]$
\end{exercise}
\section{Linear codes} The next few sections involve no
probability at all. We shall only be interested in constructing
codes which are easy to handle and have all their code words
at least a certain Hamming distance apart.
Just as ${\mathbb R}^{n}$ is a
vector space over ${\mathbb R}$ and
${\mathbb C}^{n}$ is a
vector space over ${\mathbb C}$, so
${\mathbb F}_{2}^{n}$ is a
vector space over ${\mathbb F}_{2}$.
(If you know about vector spaces over fields,
so much the better, if not, just follow
the obvious paths.)
A \emph{linear code} is a subspace of ${\mathbb F}_{2}^{n}$.
More formally, we have the following definition.
\begin{definition} A linear code is a subset of
${\mathbb F}_{2}^{n}$ such that
(i) ${\boldsymbol 0}\in C$,
(ii) if ${\mathbf x},{\mathbf y}\in C$,
then ${\mathbf x}+{\mathbf y}\in C$.
\end{definition}
Note that, if $\lambda\in{\mathbb F}$, then $\lambda=0$
or $\lambda=1$, so that condition (i) of the definition
just given guarantees that $\lambda{\mathbf x}\in C$
whenever ${\mathbf x}\in C$. We shall see that
linear codes have many useful properties.
\begin{example}
(i) The repetition code with
\[C=\{{\mathbf x}:{\mathbf x}=(x,x,\dots x)\}\]
is a linear code.
(ii) The paper tape code
\[C=\left\{{\mathbf x}:\sum_{j=0}^{n}x_{j}=0\right\}\]
is a linear code.
(iii) Hamming's original code is a linear code.
\end{example}
The verification is easy. In fact,
examples (ii) and (iii) are `parity check
codes' and so automatically linear
as we see from the next lemma.
\begin{definition} Consider a set $P$ in
${\mathbb F}_{2}^{n}$. We say that $C$
is the code defined by the set of
\emph{parity checks} $P$ if
the elements of $C$ are precisely
those ${\mathbf x}\in{\mathbb F}_{2}^{n}$
with
\[\sum_{j=1}^{n}p_{j}x_{j}=0\]
for all ${\mathbf p}\in P$.
\end{definition}
\begin{lemma} If $C$ is a code defined
by parity checks, then $C$ is linear.
\end{lemma}
We now prove the converse result.
\begin{definition}
If $C$ is a linear code, we write
$C^{\perp}$ for the set of ${\mathbf p}\in{\mathbb F}^{n}$
such that
\[\sum_{j=1}^{n}p_{j}x_{j}=0\]
for all ${\mathbf x}\in C$.
\end{definition}
\noindent
Thus $C^{\perp}$ is the set of parity
checks satisfied by $C$.
\begin{lemma}\label{anhilate}
If $C$ is a linear code, then
(i) $C^{\perp}$ is a linear code,
(ii) $(C^{\perp})^{\perp}\supseteq C$.
\end{lemma}
\noindent
We call $C^{\perp}$ the \emph{dual code} to $C$.
In the language of
the course on linear mathematics,
$C^{\perp}$ is the annihilator of $C$.
The following is a standard theorem of
that course.
\begin{lemma}\label{dimension anhilator}
If $C$ is a linear code in
${\mathbb F}_{2}^{n}$ then
\[\dim C+\dim C^{\perp}=n.\]
\end{lemma}
Since the treatment of dual spaces is not the most
popular piece of mathematics in IB, we
shall give an independent proof later
(see the note after Lemma~\ref{kernel}).
Combining Lemma~\ref{anhilate}~(ii)
with Lemma~\ref{dimension anhilator},
we get the following corollaries.
\begin{lemma} If $C$ is a linear code, then
$(C^{\perp})^{\perp}=C$.
\end{lemma}
\begin{lemma}\label{parity}
Every linear code is defined by parity
checks.
\end{lemma}
Our treatment of linear codes has
been rather abstract. In order
to put computational flesh on
the dry theoretical bones,
we introduce the notion of
a generator matrix.
\begin{definition} If $C$ is a linear
code of length $n$, any $r\times n$
matrix whose rows form a basis
for $C$ is called a \emph{generator matrix}
for $C$. We say that $C$ has \emph{dimension}
or \emph{rank} $r$.
\end{definition}
\begin{example} As examples, we can find
generator matrices for the repetition code,
the paper tape code and the original Hamming code.
\end{example}
Remember that the Hamming code is the
code of length 7 given by the parity conditions
\begin{align*}
x_{1}+x_{3}+x_{5}+x_{7}&=0\\
x_{2}+x_{3}+x_{6}+x_{7}&=0\\
x_{4}+x_{5}+x_{6}+x_{7}&=0.
\end{align*}
By using row operations and column permutations
to perform Gaussian elimination, we
can give a constructive proof of
the following lemma.
\begin{lemma} Any linear code of length $n$ has
(possibly after permuting the order of coordinates)
a generator matrix of the form
\[(I_{r}|B).\]
\end{lemma}
Notice that this means that any codeword ${\mathbf x}$
can be written as
\[({\mathbf y}|{\mathbf z})=({\mathbf y}|{\mathbf y}B)\]
where ${\mathbf y}=(y_{1},y_{2},\dots,y_{r})$ may
be considered as the message and the vector
${\mathbf z}={\mathbf y}B$ of length $n-r$ may
be considered the check digits. Any code whose codewords
can be split up in this manner is called \emph{systematic}.
We now give a more computational treatment of parity checks.
\begin{lemma}\label{kernel} If $C$ is a linear code
of length $n$
with generator matrix $G$, then ${\mathbf a}\in C^{\perp}$
if and only if
\[G{\mathbf a}^{T}={\boldsymbol 0}^{T}.\]
Thus
\[C^{\perp}=(\ker G)^{T}.\]
\end{lemma}
\noindent
Using the rank, nullity theorem,
we get a second proof of Lemma~\ref{dimension anhilator}.
Lemma~\ref{kernel} enables us to characterise
$C^{\perp}$.
\begin{lemma} If $C$ is a linear code of length $n$
and dimension $r$ with generator the $n\times r$
matrix $G$, then, if $H$ is any $n\times(n-r)$--
matrix with columns forming a basis of $\ker G$,
we know that $H$ is a parity check
matrix for $C$ and its transpose $H^{T}$ is a
generator for $C^{\perp}$.
\end{lemma}
\begin{example} (i) The dual of the paper tape
code is the repetition code.
(ii) Hamming's original code has dual with generator
matrix
\[\begin{pmatrix}
1&0&1&0&1&0&1\\
0&1&1&0&0&1&1\\
0&0&0&1&1&1&1
\end{pmatrix}\]
\end{example}
We saw above that the codewords of a linear code
can be written
\[({\mathbf y}|{\mathbf z})=({\mathbf y}|{\mathbf y}B)\]
where ${\mathbf y}$ may be considered as the vector
of message
digits and ${\mathbf z}={\mathbf y}B$ as the vector
of check digits. Thus \emph{encoders} for linear
codes are easy to construct.
What about decoders? Recall that every linear code
of length $n$ has
a (non-unique) associated parity check matrix $H$
with the property that ${\mathbf x}\in C$ if and only
if ${\mathbf x}H={\boldsymbol 0}$. If
${\mathbf z}\in{\mathbb F}_{2}^{n}$,
we define the \emph{syndrome}
of ${\mathbf z}$ to be ${\mathbf z}H$.
The following lemma is mathematically trivial
but forms the basis of the method of
\emph{syndrome decoding}.
\begin{lemma} Let $C$ be a linear code with
parity check matrix $H$.
If we are given
${\mathbf z}={\mathbf x}+{\mathbf e}$
where ${\mathbf x}$ is a code word
and the `error vector'
${\mathbf e}\in{\mathbb F}_{2}^{n}$,
then
\[{\mathbf z}H={\mathbf e}H.\]
\end{lemma}
Suppose we have tabulated the syndrome ${\mathbf u}H$
for all ${\mathbf u}$ with `few' non-zero entries
(say, all ${\mathbf u}$ with
$d({\mathbf u},{\boldsymbol 0})\leq K$).
When our decoder receives ${\mathbf z}$, it computes the
syndrome ${\mathbf z}H$. If the syndrome is zero,
then ${\mathbf z}\in C$ and the decoder assumes
the transmitted message was ${\mathbf z}$. If
the syndrome of the received message is a non-zero
vector ${\mathbf w}$, the decoder searches its list
until it finds an ${\mathbf e}$ with
${\mathbf e}H={\mathbf w}$. The decoder then
assumes that the transmitted message was
${\mathbf x}={\mathbf z}-{\mathbf e}$
(note that ${\mathbf z}-{\mathbf e}$ will
always be a codeword, even if not the right one).
This procedure will fail if ${\mathbf w}$
does not appear in the list, but, for this to
be case, at least $K+1$
errors must have occurred.
If we take $K=1$, that is we only want a 1 error
correcting code, then, writing ${\mathbf e}^{(i)}$
for the vector in ${\mathbb F}_{2}^{n}$ with
$1$ in the $i$th place and $0$ elsewhere, we
see that the syndrome ${\mathbf e}^{(i)}H$
is the $i$th row of $H$. If the transmitted
message ${\mathbf z}$ has syndrome
${\mathbf z}H$ equal to the $i$th row of $H$,
then the decoder assumes that there has been an error
in the $i$th place and nowhere else.
(Recall the special case of Hamming's original code.)
If $K$ is large the task of searching the list
of possible syndromes becomes onerous and,
unless (as sometimes happens) we can find
another trick,
we find that `decoding becomes dear'
although `encoding remains cheap'.
We conclude this section by looking at weights
and the
weight enumeration polynomial for a linear code.
The idea here is to exploit the fact that, if
$C$ is linear code and ${\mathbf a}\in C$,
then ${\mathbf a}+C=C$. Thus the `view of $C$'
from any codeword ${\mathbf a}$ is the same
as the `view of $C$' from the particular codeword
${\boldsymbol 0}$.
\begin{definition} The \emph{weight} $w({\mathbf x})$
of a vector ${\mathbf x}\in{\mathbb F}_{2}^{n}$
is given by
\[w({\mathbf x})=d({\boldsymbol 0},{\mathbf x}).\]
\end{definition}
\begin{lemma} If $w$ is the weight function on
${\mathbb F}_{2}^{n}$ and
${\mathbf x},\,{\mathbf y}\in{\mathbb F}_{2}^{n}$,
then
(i) $w({\mathbf x})\geq 0$,
(ii) $w({\mathbf x})=0$ if and only if
${\mathbf x}={\boldsymbol 0}$,
(iii) $w({\mathbf x})+w({\mathbf y})\geq
w({\mathbf x}+{\mathbf y})$.
\end{lemma}
Since the minimum (non-zero) weight in a linear code
is the same as the minimum (non-zero)
distance, we can talk
about linear codes of minimum weight $d$
when we mean linear codes of minimum distance $d$.
The pattern of distances in a linear code
is encapsulated in the
weight enumeration polynomial.
\begin{definition}\label{enumeration}
Let $C$ be a linear code of length $n$.
We write $A_{j}$ for the number of codewords
of weight $j$ and define the
\emph{weight enumeration polynomial} $W_{C}$
to be the polynomial in two real variables
given by
\[W_{C}(s,t)=\sum_{j=0}^{n}A_{j}s^{j}t^{n-j}.\]
\end{definition}
Here are some simple properties of $W_{C}$.
\begin{lemma} Under the assumptions and
with the notation of Definition~\ref{enumeration},
the following results are true.
(i) $W_{C}$ is a homogeneous polynomial of degree $n$.
(ii) If $C$ has rank $r$, then $W_{C}(1,1)=2^{r}$.
(iii) $W_{C}(0,1)=1$.
(iv) $W_{C}(1,0)$ takes the value $0$ or $1$.
(v) $W_{C}(s,t)=W_{C}(t,s)$ for all $s$ and $t$
if and only if $W_{C}(1,0)=1$.
\end{lemma}
\begin{lemma} For our standard model of communication
along an error prone channel with independent errors
of probability $p$ and a linear code $C$ of length $n$,
\[W_{C}(p,1-p)=\Pr(\text{receive a code word}\ |
\ \text{code word transmitted})\]
and
\[\Pr(\text{receive incorrect code word}\ |
\ \text{code word transmitted})=W_{C}(p,1-p)-(1-p)^{n}.\]
\end{lemma}
\begin{example}\label{Example William}
(i) If $C$ is the repetition code,
$W_{C}(s,t)=s^{n}+t^{n}$.
(ii) If $C$ is the paper tape code of length $n$,
$W_{C}(s,t)=\frac{1}{2}((s+t)^{n}+(t-s)^{n})$.
\end{example}
Example~\ref{Example William} is a special case of the
MacWilliams identity.
\begin{theorem}{\bf [MacWilliams identity]} If $C$ is a linear
code
\[W_{C^{\perp}}(s,t)=2^{-\dim C}W_{C}(t-s,t+s).\]
\end{theorem}
\noindent
We give a proof as Exercise~\ref{E;MacWilliams}.
(The result is thus not bookwork though it
could be set as a problem with appropriate hints.)
\section{Some general constructions} However
interesting the theoretical study of codes may be
to a pure mathematician,
the engineer would prefer to have an arsenal
of practical codes so that she can select
the one most suitable for the job in hand.
In this section we discuss the general Hamming
codes and the Reed-Muller codes as well as some
simple methods of obtaining new codes from
old.
\begin{definition} Let $d$ be a strictly
positive integer and let $n=2^{d}-1$.
Consider the (column) vector
space $D={\mathbb F}_{2}^{d}$. Write down
a $d\times n$ matrix $H$ whose columns
are the $2^{d}-1$ distinct non-zero vectors
of $D$. The Hamming $(n,n-d)$ code is
the linear code of length $n$ with $H^{T}$
as parity check matrix.
\end{definition}
Of course the Hamming $(n,n-d)$ code is only
defined up to permutation of coordinates.
We note that $H$ has rank $d$, so a simple
use of the rank nullity theorem shows
that our notation is consistent.
\begin{lemma} The Hamming $(n,n-d)$ code is
a linear code of length $n$ and rank $n-d$ $[n=2^{d}-1]$.
\end{lemma}
\begin{example} The Hamming $(7,4)$ code
is the original Hamming code.
\end{example}
The fact that any two rows of $H$ are linearly
independent and a look at the appropriate syndromes
gives us the main property of the general Hamming code.
\begin{lemma} The Hamming $(n,n-d)$ code
has minimum weight $3$ and is a perfect
1 error correcting code $[n=2^{d}-1]$.
\end{lemma}
Hamming codes are ideal in situations where
very long strings of binary digits must be transmitted
but the chance of an error in any individual
digit is very small. (Look at Exercise~\ref{bacon}.)
Although the search for perfect codes other than the Hamming codes
produced
the Golay code (not discussed here) and much interesting
combinatorics, the reader is warned that, from a practical
point of view, it represents a dead
end\footnote{If we confine ourselves to the binary
codes discussed in this course, it is known
that perfect codes of length $n$ with Hamming spheres of radius
$\rho$ exist for $\rho=0$, $\rho=n$, $\rho=(n-1)/2$,
with $n$ odd (the three codes just mentioned are
easy to identify), $\rho=3$ and $n=23$ (the Golay code,
found by direct search) and $\rho=1$ and $n=2^{m}-1$.
There are known to be non-Hamming codes with $\rho=1$ and $n=2^{m}-1$,
it is suspected that there are many of them and
they are the subject of much research, but,
of course they present no practical advantages.
The only linear perfect codes with $\rho=1$ and $n=2^{m}-1$
are the Hamming codes.}.
Here are a number of simple tricks for creating new
codes from old.
\begin{definition} If $C$ is a code of
length $n$, the \emph{parity check extension}
$C^{+}$ of $C$ is the code of length $n+1$ given by
\[C^{+}=\left\{{\mathbf x}\in{\mathbb F}_{2}^{n+1}:
(x_{1},x_{2},\dots,x_{n})\in C,\ \sum_{j=1}^{n+1}x_{j}=0
\right\}.\]
\end{definition}
\begin{definition} If $C$ is a code of
length $n$, the \emph{truncation}
$C^{-}$ of $C$ is the code of length $n-1$ given by
\[C^{-}=\{(x_{1},x_{2},\dots,x_{n-1}):
(x_{1},x_{2},\dots,x_{n})\in C
\ \text{for some $x_{n}\in{\mathbb F}_{2}$}\}.\]
\end{definition}
\begin{definition} If $C$ is a code of
length $n$, the \emph{shortening} (or \emph{puncturing})
$C'$ of $C$ by the symbol $\alpha$ (which may be $0$
or $1$)
is the code of length $n-1$ given by
\[C'=\{(x_{1},x_{2},\dots,x_{n-1}):
(x_{1},x_{2},\dots,x_{n-1},\alpha)\in C\}.\]
\end{definition}
\begin{lemma} If $C$ is linear, so is its
parity check extension $C^{+}$, its
truncation $C^{-}$ and its shortening $C'$ (provided
that the symbol chosen is $0$).
\end{lemma}
How can we combine two linear codes $C_{1}$ and $C_{2}$?
Our first thought might be to look at their
direct sum
\[C_{1}\oplus C_{2}=\{({\mathbf x}|{\mathbf y})
:{\mathbf x}\in C_{1},\ {\mathbf y}\in C_{2}\},\]
but this is unlikely to be satisfactory.
\begin{lemma} If $C_{1}$ and $C_{2}$ are linear codes,
then we have the following relation between
minimum distances.
\[d(C_{1}\oplus C_{2})=\min\big(d(C_{1}),d(C_{2})\big).\]
\end{lemma}
On the other hand, if $C_{1}$ and $C_{2}$ satisfy
rather particular conditions, we can obtain
a more promising construction.
\begin{definition} Suppose $C_{1}$ and $C_{2}$
are linear codes of length $n$
with $C_{1}\supseteq C_{2}$
(i.e. with $C_{2}$ a subspace of $C_{1}$). We define
the \emph{bar product} $C_{1}|C_{2}$ of $C_{1}$
and $C_{2}$ to be the code of length $2n$ given by
\[C_{1}|C_{2}=\{({\mathbf x}|{\mathbf x}+{\mathbf y})
:{\mathbf x}\in C_{1},\ {\mathbf y}\in C_{2}\}.\]
\end{definition}
\begin{lemma} Let $C_{1}$ and $C_{2}$
be linear codes of length $n$
with $C_{1}\supseteq C_{2}$. Then
the bar product $C_{1}|C_{2}$
is a linear code with
\[\rank C_{1}|C_{2}=\rank C_{1}+\rank C_{2}.\]
The minimum distance of $C_{1}|C_{2}$ satisfies
the equality
\[d(C_{1}|C_{2})=\min(2d(C_{1}),d(C_{2})).\]
\end{lemma}
We now return to the construction of specific codes.
Recall that the Hamming codes are suitable for situations
when the error rate $p$ is very small and we want
a high information rate. The Reed-Muller are suitable
when the error rate is very high and we are prepared
to sacrifice information rate. They were used by NASA
for the radio transmissions from its planetary probes
(a task which has been compared
to signalling across
the Atlantic with a child's torch\footnote{Strictly
speaking, the comparison is meaningless. However,
it sounds impressive and that is the main thing.}).
We start by considering the $2^{d}$ points $P_{0}$, $P_{1}$,
\dots, $P_{2^{d}-1}$ of the space
$X={\mathbb F}_{2}^{d}$. Our code
words will be of length $n=2^{d}$ and will
correspond to the indicator functions
${\mathbb I}_{A}$ on $X$. More specifically,
the possible code word ${\mathbf c}^{A}$ is given
by
\begin{alignat*}{2}
c_{i}^{A}&=1&&\qquad\text{if $P_{i}\in A$}\\
c_{i}^{A}&=0&&\qquad\text{otherwise}.
\end{alignat*}
for some $A\subseteq X$.
In addition to
the usual vector space structure on ${\mathbb F}_{2}^{n}$,
we define a new operation
\[{\mathbf c}^{A}\wedge{\mathbf c}^{B}
={\mathbf c}^{A\cap B}.\]
Thus, if ${\mathbf x},{\mathbf y}\in {\mathbb F}_{2}^{n}$,
\[(x_{0},x_{1},\dots,x_{n-1})\wedge(y_{0},y_{1},\dots,y_{n-1})
=(x_{0}y_{0},x_{1}y_{1},\dots,x_{n-1}y_{n-1}).\]
Finally we consider the collection of $d$ hyperplanes
\[\pi_{j}=\{{\mathbf p}\in X:p_{j}=0\}\ \ [1\leq j\leq d]\]
in ${\mathbb F}_{2}^{n}$
and the corresponding indicator functions
\[{\mathbf h}^{j}={\mathbf c}^{\pi_{j}},\]
together with the special vector
\[{\mathbf h}^{0}={\mathbf c}^{X}=(1,1,\dots,1).\]
\begin{exercise} Suppose that
${\mathbf x},{\mathbf y},{\mathbf z}\in {\mathbb F}_{2}^{n}$
and $A,B\subseteq X$.
(i) Show that
${\mathbf x}\wedge{\mathbf y}={\mathbf y}\wedge{\mathbf x}$.
(ii) Show that
$({\mathbf x}+{\mathbf y})\wedge{\mathbf z}
={\mathbf x}\wedge{\mathbf z}+{\mathbf y}\wedge{\mathbf z}$.
(iii) Show that ${\mathbf h}^{0}\wedge{\mathbf x}={\mathbf x}$.
(iv) If ${\mathbf c}^{A}+{\mathbf c}^{B}={\mathbf c}^{E}$,
find $E$ in terms of $A$ and $B$.
(v) If ${\mathbf h}^{0}+{\mathbf c}^{A}={\mathbf c}^{E}$,
find $E$ in terms of $A$.
\end{exercise}
We refer to ${\mathcal A}_{0}=\{{\mathbf h}^{0}\}$ as the
set of terms of order zero.
If ${\mathcal A}_{k}$ is the set of terms of order at most $k$,
then the set ${\mathcal A}_{k+1}$ of terms of order at most $k+1$
is defined by
\[{\mathcal A}_{k+1}=\{{\mathbf a}\wedge{\mathbf h}^{j}:
{\mathbf a}\in {\mathcal A}_{k},\ 1\leq j\leq d\}.\]
Less formally, but more clearly, the elements
of order 1 are the ${\mathbf h}^{i}$, the elements
of order 2 are the ${\mathbf h}^{i}\wedge{\mathbf h}^{j} $
with $iN,\,n$. She then chooses integers
$a_{1}$, $a_{2}$, \dots, $a_{k-1}$ at random
and distinct integers
$x_{1}$, $x_{2}$, \dots, $x_{n}$ at random
subject to $0\leq a_{j}\leq p-1$,
$1\leq x_{j}\leq p-1$, sets $a_{0}=S$
and computes
\[P(r)\equiv a_{0}+a_{1}x_{r}+a_{2}x_{r}^{2}+\dots+a_{k-1}x_{r}^{k-1}\mod{p}\]
choosing $0\leq P(r)\leq p-1$.
She then gives the $r$th member of the Faculty Board
the pair of numbers $\big(x_{r},P(r)\big)$ (the shadow pair),
to be kept secret
from everybody else) and tells everybody the value of $p$.
She then burns her calculations.
Suppose that $k$ members of the Faculty Board
with shadow pairs
$\big(y_{j},Q(j)\big)=\big(x_{r_{j}},P(r_{j})\big)$
$[1\leq j\leq k]$ are together.
By the properties of the Van der Monde determinant
(see~Lemma~\ref{L;van der Monde})
\begin{align*}
\begin{vmatrix}
1&y_{1}&y_{1}^{2}&\hdots&y_{1}^{k-1}\\
1&y_{2}&y_{2}^{2}&\hdots&y_{2}^{k-1}\\
1&y_{3}&y_{3}^{2}&\hdots&y_{3}^{k-1}\\
\vdots&\vdots&\vdots&\ddots&\vdots\\
1&y_{k}&y_{k}^{2}&\hdots&y_{k}^{k-1}
\end{vmatrix}
&\equiv
\begin{vmatrix}
1&1&1&\hdots&1\\
y_{1}&y_{2}&y_{3}&\hdots&y_{k}\\
y_{1}^{2}&y_{2}^{2}&y_{3}^{2}&\hdots&y_{k}^{2}\\
\vdots&\vdots&\vdots&\ddots&\vdots\\
y_{1}^{k-1}&y_{2}^{k-1}&y_{3}^{k-1}&\hdots&y_{k}^{k-1}
\end{vmatrix}\\
&\equiv\prod_{1\leq j*2^{n}$ and is
likely to happen if $2^{n}$ is substantially
greater than $|\mathcal{K}||\mathcal{P}|$.
(Here, as usual, $|\mathcal{B}|$
denotes the number of elements of $\mathcal{B}$.)
Now recall the definition of the information rate
given in Definition~\ref{information rate}.
If the message set ${\mathcal M}$
has information rate $\mu$
and the key set (that is the
shared secret set) $\mathcal{K}$
has information rate $\kappa$, then, taking logarithms,
we see that, if
\[n-m\kappa-n\mu\]
is substantially greater than $0$, then $\bigstar$
is likely to have a unique solution, but, if it is
substantially smaller, this is unlikely.
\begin{example} Suppose that,
instead of using binary
code, we consider an alphabet of 27 letters
(the English alphabet plus a space). We must
take logarithms to the base 27, but the considerations
above continue to apply. The English language
treated in this way has information rate
about .4. (This is very much a ball park figure.
The information rate is certainly less than .5
and almost certainly greater than .2.)
(i) In the Caesar code, we replace the $i$th element
of our alphabet by the $i+j$th (modulo 27). The
shared secret is a single letter (the code for $A$ say).
We have $m=1$, $\kappa=1$ and $\mu\approx .4$. Thus
\[n-m\kappa-n\mu\approx .6n-1.\]
If $n=1$ (so $n-m\kappa-n\mu\approx-.4$) it is obviously
impossible to decode the message. If $n=10$
(so $n-m\kappa-n\mu\approx 5$) a simple search through the
27 possibilities will almost always give a single
possible decode.
(ii) In a simple substitution code, a permutation of the
alphabet is chosen and applied to each letter of the
code in turn. The shared secret is a sequence of
26 letters (given the coding of the first 26 letters,
the 27th can then be deduced).
We have $m=26$, $\kappa=1$ and $\mu\approx .4$. Thus
\[n-m\kappa-n\mu\approx .6n-26.\]
In \emph{The Dancing Men}, Sherlock Holmes
solves such a code with $n=68$
(so $n-m\kappa-n\mu\approx 15$) without straining
the reader's credulity too much and I
would think that, unless the message is
very carefully chosen, most of my audience
could solve such a code with $n=200$
(so $n-m\kappa-n\mu\approx 100$).
(iii) In the one-time pad $m=n$ and $\kappa=1$,
so (if $\mu>0$)
\[n-m\kappa-n\mu=-n\mu\rightarrow-\infty\]
as $n\rightarrow\infty$.
(iv) Note that the larger $\mu$ is, the slower
$n-m\kappa-n\mu$ increases. This corresponds
to the very general statement that the
higher the information rate of the messages,
the harder it is to break the code in which they
are sent.
\end{example}
The ideas just introduced can be formalised
by the notion of unicity distance.
\begin{definition}\label{unicity}
The \emph{unicity distance}
of a code is the number of bits of message
required to exceed the number of bits of
information in the key plus the number
of bits of information in the message.
\end{definition}
(The notion of information content
brings us back to Shannon whose paper
\emph{Communication theory of secrecy systems}\footnote{Available
on the web and in his \emph{Collected Papers}.},
published in 1949, forms the first modern treatment
of cryptography in the open literature.)
If we only use our code once to send a message
which is substantially shorter than the unicity
distance, we can be confident that no code breaker,
however gifted, could break it, simply because
there is no unambiguous decode.
(A one-time pad has unicity distance infinity.)
However, the fact that there is a unique solution
to a problem does not mean that it is easy
to find.
We have excellent reasons, some of which
are spelled out in the next section, to believe that
there exist codes for which the unicity distance
is essentially irrelevant to the maximum safe length
of a message. For these codes, even though there may
be a unique solution, the amount of work required
to find the solutions makes (it is hoped) any
attempt impractical.
\section{Asymmetric systems}\label{S;symmetric} Towards the end
of the previous section, we discussed a general
coding scheme depending on a shared secret
key ${\mathbf k}$ known to the encoder
and the decoder. The scheme can be generalised
still further by splitting the secret in two.
Consider a system
with two functions
$F:{\mathbb F}_{2}^{m}\times{\mathbb F}_{2}^{n}
\rightarrow{\mathbb F}_{2}^{q}$ and
$G:{\mathbb F}_{2}^{p}\times{\mathbb F}_{2}^{q}
\rightarrow{\mathbb F}_{2}^{n}$
such that
\[G({\mathbf l},F({\mathbf k},{\mathbf p}))={\mathbf p}.\]
Here $({\mathbf k},{\mathbf l})$ will be
be a pair of secrets, ${\mathbf p}$ the message and
${\mathbf z}=F({\mathbf k},{\mathbf p})$ the
encoded message which can be decoded by
using the fact that $G({\mathbf l},{\mathbf z})
={\mathbf p}$. In this scheme, the encoder
must know ${\mathbf k}$, but need not know ${\mathbf l}$
and the decoder must know ${\mathbf l}$,
but need not know ${\mathbf k}$. Such a system
is called asymmetric.
So far the idea is interesting but not exciting.
Suppose, however, that we can show that
(i) knowing $F$, $G$ and $\mathbf k$
it is very hard to find ${\mathbf l}$
(ii) if we do not know ${\mathbf l}$ then, even if we know $F$, $G$
and ${\mathbf k}$, it is very hard to find ${\mathbf p}$
from $F({\mathbf k},{\mathbf p})$.
\noindent
Then the code is secure at what we called level (3).
\begin{lemma}~\label{level 3}
Suppose that the conditions specified
above hold. Then an opponent who is entitled to
demand the encodings ${\mathbf z}_{i}$ of
any messages ${\mathbf p}_{i}$ they choose
to specify will still find it very hard to find
$\mathbf{p}$ when given
$F({\mathbf k},{\mathbf p})$.
\end{lemma}
Let us write $F({\mathbf k},{\mathbf p})={\mathbf p}^{K_{A}}$
and $G({\mathbf l},{\mathbf z})={\mathbf z}^{K_{A}^{-1}}$
and think of ${\mathbf p}^{K_{A}}$ as participant
$A$'s encipherment of $\mathbf{p}$ and
${\mathbf z}^{K_{A}^{-1}}$ as participant $B$'s
decipherment of ${\mathbf z}$. We then have
\[({\mathbf p}^{K_{A}})^{K_{A}^{-1}}={\mathbf p}.\]
Lemma~\ref{level 3} tells us that such a system
is secure however many messages are sent.
Moreover, if we think of $A$ as a spy-master,
he can broadcast $K_{A}$ to the world (that
is why such systems are called public key
systems) and invite anybody who
wants to spy for him to send him secret messages
in total confidence\footnote{Although we
make statements about certain codes along the lines
of `It does not matter who knows this', you
should remember the German naval saying
`All radio traffic is high treason'. If
any aspect of a code can be
kept secret, it should be kept secret.}.
It is all very well to describe such a code,
but do they exist? There is very strong evidence
that they do, but, so far, all mathematicians
have been able to do is to show that
provided certain mathematical problems which
are believed to be hard are indeed hard,
then good codes exist.
The following problem is believed to be hard.
\noindent
{\bf Problem} \emph{Given an integer $N$, which is
known to be the product $N=pq$ of two primes
$p$ and $q$, find $p$ and $q$.}
\noindent
Several schemes have been proposed based on the
assumption that this factorisation is hard.
(Note, however, that it is easy to find large `random'
primes $p$ and $q$.) We give a very elegant
scheme due to Rabin and Williams. It makes use
of some simple number theoretic
results from IA and IB.
The reader may well have seen the following results
before.
In any case, they are
easy to obtain by considering primitive
roots.
\begin{lemma} If $p$ is an odd prime the congruence
\[x^{2}\equiv d\mod{p}\]
is soluble if and only if $d\equiv 0$ or
$d^{(p-1)/2}\equiv 1$ modulo $p$.
\end{lemma}
\begin{lemma} Suppose $p$ is a prime such that
$p=4k-1$ for some integer $k$. Then, if the congruence
\[x^{2}\equiv d\mod{p}\]
has any solution, it has $d^{k}$ as a solution.
\end{lemma}
We now call on the Chinese remainder theorem.
\begin{lemma}\label{square root}
Let $p$ and $q$ be primes of the
form $4k-1$ and set $N=pq$.
Then the following two problems are of equivalent
difficulty.
(A) Given $N$ and $d$ find all the $m$ satisfying
\[m^{2}\equiv d\mod{N}.\]
(B) Given $N$ find $p$ and $q$.
\end{lemma}
\noindent
(Note that, provided that $d\not\equiv 0$,
knowing the solution to (A) for any $d$
gives us the four solutions for the case $d=1$.)
The result is also true but much harder
to prove for general primes $p$ and $q$.
At the risk of giving aid and comfort
to followers of the
Lakatosian heresy, it must be admitted that
the statement of Lemma~\ref{square root}
does not really tell us what the result
we are proving is,
although the proof makes it clear
that the result (whatever it may be) is certainly
true. However, with more work, everything can be
made precise.
We can now give the Rabin--Williams scheme.
The spy-master $A$ selects two very large
primes $p$ and $q$. (Since he has only done
an undergraduate course in mathematics,
he will take $p$ and $q$ of the form $4k-1$.)
He keeps the pair $(p,q)$ secret, but broadcasts
the public key $N=pq$. If $B$ wants to
send him a message, she writes it in binary code and
splits it into blocks of length $m$ with
$2^{m}p$ (with $q$ chosen in advance)
the probability that she will have intercepted more than
a proportion $q$ of the remaining `good' photons is less than $\epsilon/4$.
Although we shall not do this, the reader who has
ploughed through these notes will readily accept that
Bob and Alice can use the message conveyed through the
remaining good photons to construct
a common secret such that Eve has probability
less than $\epsilon/4$ of guessing
it.
Thus, unless they decide that their
messages are being partially read,
Alice and Bob can agree a shared secret
with probability less than $\epsilon$ that
an eavesdropper can guess it.
There are various gaps in the exposition above.
First we have assumed that Eve must hold her polariser
at a small fixed number of angles. A little thought
shows that allowing her a free choice of angle will make
little difference. Secondly, since physical systems
always have imperfections, some `good' photons
will produce errors even in the absence of
Eve. This means that $p$ in STEP~5 must
be chosen above the `natural noise level'
and the sequences must be longer but, again,
this ought to make little difference.
There is a further engineering problem that
it is very difficult just to send single photons
every time. If there are too many groups of photons,
then Eve only need capture one and let the rest go,
so we can not detect eavesdropping. If there are only a few,
then the values of $p$ and $q$ can be adjusted to
take account of this. There are several networks in existence
which employ quantum cryptography.
Quantum cryptography has definite advantages when
matched individually against RSA, secret sharing (using a large
number of independent channels) or one-time pads.
It is less easy to find applications where
it is better than the best choice of one of
these three `classical' methods\footnote{One problem
is indicated by the first British military action
in World War I which was to cut the undersea
telegraph cables linking Germany to the outside world.
Complex systems are easier to disrupt than simple ones.}.
Of course, quantum cryptography will appeal to
those who need to persuade others that they are using
the latest and most expensive technology to
guard their secrets. However as I said before
\emph{coding schemes
are at best, cryptographic elements
of larger possible cryptographic systems.}
If smiling white coated technicians install big
gleaming machines with `Unbreakable
Quantum Code Company' painted in large letters
above the keyboard in the homes of Alice and Bob,
it does not automatically follow that their
communications are safe. Money will buy
the appearance of security. Only thought will buy
the appropriate security for a given purpose
at an appropriate cost. And even then we can not
be sure.
\begin{verse}
As we know,\\
There are known knowns.\\
There are things we know we know.\\
We also know\\
There are known unknowns.\\
That is to say\\
We know there are some things\\
We do not know.\\
But there are also unknown unknowns,\\
The ones we don't know\\
We don't know\footnote{Rumsfeld}.
\end{verse}
\section{Further reading} For many students
this will be one of the last university mathematics
course they will take. Although the twin
subjects of error-correcting codes and
cryptography occupy a small place in the
grand panorama of modern mathematics,
it seems to me that they form a very suitable
topic for such a final course.
Outsiders often think of mathematicians as
guardians of abstruse but settled knowledge.
Even those who understand that there are
still problems unsettled, ask what
mathematicians will do when they run out
of problems. At a more subtle
level, Kline's magnificent
\emph{Mathematical Thought from Ancient to
Modern Times}~\cite{Kline} is pervaded by
the melancholy thought
that, though the problems
will not run out, they may become more
and more baroque and inbred. `You are not the mathematicians
your
parents were' whispers Kline
`and your problems are not the
problems your parents' were.'
However, when we look at this course, we
see that the idea of error-correcting codes
did not exist before 1940. The best
designs of such codes depend on the kind
of `abstract algebra' that historians
like Kline and Bell consider a dead end,
and lie behind the superior
performance of CD players and
similar artifacts.
In order to go further into the study of codes, whether
secret or error correcting, we need to go
into the question of how the information
content of a message is to be measured.
`Information theory' has its roots in the
code breaking of World War II (though
technological needs would doubtless
have led to the same ideas shortly
thereafter anyway). Its development
required a level of sophistication
in treating probability which was simply
not available in the 19th century.
(Even the Markov chain is essentially 20th
century\footnote{We are now in the 21st century,
but I suspect that we are still part
of the mathematical `long 20th century'
which started in the 1880s with the work of Cantor
and like minded contemporaries.}.)
The question of what makes a calculation difficult
could not even have been thought about until
G\"{o}del's theorem (itself a product of
the great `foundations crisis' at the beginning
of the 20th century). Developments by
Turing and Church of G\"{o}del's theorem
gave us a theory of computational complexity
which is still under development today.
The question of whether there exist
`provably hard' public codes is intertwined
with still unanswered questions in complexity
theory. There are links with the profound
(and very 20th century) question of what
constitutes a random number.
Finally, the invention of the electronic computer
has produced a cultural change in the attitude
of mathematicians towards algorithms. Before 1950,
the construction of algorithms was a minor interest
of a few mathematicians. (Gauss and Jacobi were
considered unusual in the amount of thought they
gave to actual computation.) Today, we would consider
a mathematician
as much as a maker of algorithms as a prover of theorems.
The notion of the \emph{probabilistic algorithm}
which hovered over much of our discussion of
secret codes is a typical invention of the last
decades of the 20th century.
Although both the subjects of error correcting and
secret codes are now `mature' in the
sense that they provide usable and well tested
tools for practical application, they still
contain deep unanswered questions. For example
How close to the Shannon bound can a
`computationally easy' error correcting code
get?
Do provably hard public codes exist?
Even if these questions are too hard, there
must surely exist error correcting and public
codes based on new ideas\footnote{Just as quantum
cryptography was.}. Such ideas would
be most welcome
and, although they are most likely to come
from the professionals, they might come
from outside the usual charmed circles.
Those who wish to learn about error correction
from the horse's mouth will consult Hamming's
own book on the matter~\cite{Hamming}.
For the present course,
the best book I know for further reading
is Welsh~\cite{Welsh}. After this,
the book of Goldie and Pinch~\cite{Pinch} provides
a deeper idea of the meaning of information
and its connection with the topic. The book by
Koblitz~\cite{Koblitz} develops the number theoretic
background.
The economic and practical importance of
transmitting, storing and processing data
far outweighs the importance of hiding it.
However, hiding data is more romantic.
For budding cryptologists and cryptographers
(as well as those who want a good read),
Kahn's \emph{The Codebreakers}~\cite{Kahn Code}
has the same role as is taken by Bell's
\emph{Men of Mathematics}
for budding mathematicians.
I conclude with a quotation from Galbraith
(referring to his time as ambassador to India)
taken from Koblitz's entertaining text~\cite{Koblitz}.
\begin{quotation}
I had asked that a cable from Washington to New Delhi
\dots be reported to me through the Toronto consulate.
It arrived in code; no facilities existed for decoding.
They brought it to me at the airport --- a mass of
numbers. I asked if they assumed I could read it.
They said no. I asked how they managed.
They said that when something arrived in code,
they phoned Washington and had the original
read to them.
\end{quotation}
\begin{thebibliography}{99}
\bibitem{Eco} U.~Eco \emph{The Search for the Perfect Language}
(English translation), Blackwell, Oxford 1995.
\bibitem{Hamming} R.~W.~Hamming \emph{Coding and Information Theory}
(2nd edition) Prentice Hall, 1986.
\bibitem{Kahn Code} D.~Kahn \emph{The Codebreakers:
The Story of Secret Writing} MacMillan, New York, 1967.
(A lightly revised edition has recently appeared.)
\bibitem{Kahn Enigma} D.~Kahn
\emph{Seizing the Enigma} Houghton Mifflin, Boston, 1991.
\bibitem{Kline} M.~Kline
\emph{Mathematical Thought from Ancient to
Modern Times} OUP, 1972.
\bibitem{Koblitz} N.~Koblitz
\emph{A Course in Number Theory and Cryptography}
Springer, 1987.
\bibitem{Knuth} D.~E.~Knuth
\emph{The Art of Computing Programming}
Addison-Wesley. The third edition of
Volumes I to III is appearing during
this year and the next (1998--9).
\bibitem{Pinch} G.~M.~Goldie and R.~G.~E.~Pinch
\emph{Communication Theory}
CUP, 1991.
\bibitem{Thompson} T.~M.~Thompson
\emph{From Error-correcting Codes through Sphere Packings
to Simple Groups} Carus Mathematical Monographs {\bf 21},
MAA, Washington DC, 1983.
\bibitem{Welsh} D.~Welsh \emph{Codes and Cryptography}
OUP, 1988.
\end{thebibliography}
\newpage
There is a widespread superstition, believed both by supervisors
and supervisees, that exactly twelve questions are required
to provide full understanding of six hours of mathematics
and that the same twelve questions should be appropriate
for students of all abilities and all levels of diligence. I have tried
to keep this in mind, but have provided some extra questions
in the various exercise sheets
for those who scorn such old wives' tales.
\section{Exercise Sheet 1}
\begin{question}\label{C1.1}
(Exercises~\ref{E;Morse} and~\ref{E;ASCII}.) (i)
Consider Morse code.
\begin{align*}
A\mapsto \bullet-*\qquad
&&B\mapsto -\bullet\bullet\bullet*\qquad
&&C\mapsto-\bullet-\bullet*\\
D\mapsto -\bullet\bullet*\qquad
&&E\mapsto \bullet*\qquad
&&F\mapsto\bullet\bullet-\bullet*\\
O\mapsto ---*\qquad
&&S\mapsto\bullet\bullet\bullet*\qquad
&&7\mapsto--\bullet\bullet\bullet*
\end{align*}
Decode
$-\bullet-\bullet*---*-\bullet\bullet* \bullet*$.
(ii) Consider ASCII code.
\begin{align*}
A\mapsto 1000001\qquad
&&B\mapsto 1000010\qquad
&&C\mapsto 1000011\\
a\mapsto 1100001\qquad
&&b\mapsto 1100010 \qquad
&&c\mapsto 1100011\\
+\mapsto 0101011\qquad
&&!\mapsto 0100001\qquad
&&7\mapsto 0110111
\end{align*}
Encode $b7!$. Decode $110001111000011100010$.
\end{question}
\begin{question}\label{C1.2}
(Exercises~\ref{E;fix length},~\ref{E;decodable}
and~\ref{E;auto free}.)
Consider two alphabets ${\mathcal A}$
and ${\mathcal B}$
and a coding function $c:{\mathcal A}\rightarrow{\mathcal B}^{*}$
(i) Explain, without using the notion of prefix-free
codes, why, if $c$ is injective
and fixed length, $c$
is decodable. Explain why, if $c$ is injective
and fixed length, $c$
is prefix-free.
(ii) Let ${\mathcal A}={\mathcal B}=\{0,1\}$. If
$c(0)=0$, $c(1)=00$ show that $c$ is injective but $c^{*}$ is not.
(iii) Let ${\mathcal A}=\{1,2,3,4,5,6\}$ and ${\mathcal B}=\{0,1\}$.
Show that there is a variable length coding $c$ such that
$c$ is injective and all code words have length $2$ or less.
Show that there is no decodable coding $c$ such that
all code words have length $2$ or less
\end{question}
\begin{question}\label{C1.3} The product of two codes
$c_{j}:{\mathcal A}_{j}\rightarrow{\mathcal B}^{*}_{j}$
is the code
\[g:{\mathcal A}_{1}\times{\mathcal A}_{2}
\rightarrow({\mathcal B}_{1}\cup{\mathcal B}_{2})^{*}\]
given by
$g(a_{1},a_{2})=c_{1}(a_{1})c_{2}(a_{2})$.
Show that the product of two prefix-free codes is prefix
free, but the product of a decodable code and a prefix-free
code need not even be decodable.
\end{question}
\begin{question}\label{C1.4}\label{E;Huffman 1 again}
(Exercises~\ref{E;Huffman 1} and~\ref{E;Huffman 3})
(i) Apply Huffman's algorithm to the nine messages $M_{j}$ where
$M_{j}$ has probability $j/45$
for $1\leq j\leq 9$.
(ii) Consider
$4$ messages with the following properties.
$M_{1}$ has probability $.23$,
$M_{2}$ has probability $.24$,
$M_{3}$ has probability $.26$
and $M_{4}$ has probability $.27$. Show that any
assignment of the code words $00$, $01$, $10$
and $11$ produces a best code in the sense of this course.
\end{question}
\begin{question}\label{C1.5}
(Exercises~\ref{E;Huffman 2} and~\ref{E;Memory}.)
(i) Consider $64$ messages $M_{j}$.
$M_{1}$ has probability $1/2$,
$M_{2}$ has probability $1/4$ and $M_{j}$ has probability
$1/248$ for $3\leq j\leq 64$.
Explain why, if we use code words of equal length,
then the length of a code word must be at least $6$.
By using the ideas of Huffman's algorithm (you should not
need to go through all the steps) obtain a set of
code words such that the \emph{expected} length of a code word
sent is no more than $3$.
(ii) Let $a,\,b>0$. Show that
\[\log_{a} b=\frac{\log b}{\log a}.\]
\end{question}
\begin{question}\label{C1.6} (Exercise~\ref{E;Fano})
(i) Let ${\mathcal A}=\{1,2,3,4\}$. Suppose that the
probability
that letter $k$ is chosen is $k/10$.
Use your calculator
to find $\lceil -\log_{2} p_{k}\rceil$
and write down a Shannon--Fano
code $c$.
(ii) We found a Huffman code $c_{h}$ for the system
in Example~\ref{E;Huffman do}.
Show
that the entropy is approximately $1.85$,
that ${\mathbb E}|c(A)|=2.4$
and that ${\mathbb E}|c_{h}(A)|=1.9$.
Check that these results are consistent with
the appropriate theorems of the course.
\end{question}
\begin{question}\label{C1.7}
(Exercise~\ref{E;Markov})
Suppose that we have a sequence
$X_{j}$ of random variables taking the values
$0$ and $1$. Suppose that $X_{1}=1$ with probability $1/2$
and
$X_{j+1}=X_{j}$ with probability
$.99$ independent of what has gone before.
(i) Suppose we wish to send $10$ successive bits
$X_{j}X_{j+1}\dots X_{j+9}$. Show that if we associate
the sequence of ten zeros with $0$, the sequence
of ten ones with $10$ and any other sequence
$a_{0}a_{1}\dots a_{9}$ with $11a_{0}a_{1}\dots a_{9}$,
we have a decodable code which on average requires
about $5/2$ bits to transmit the sequence.
(ii) Suppose we wish to send the bits
$X_{j}X_{j+10^{6}}X_{j+2\times 10^{6}}\dots X_{j+9\times 10^{6}}$.
Explain why any decodable code will require
on average at least $10$ bits to transmit the sequence.
(You need not do detailed computations.)
\end{question}
\begin{question}\label{C1.8}
In Bridge, a $52$ card pack is dealt
to provide $4$ hands of $13$ cards each.
(i) Purely as
a matter of interest, we consider the following question.
If the contents of a hand are conveyed by one player
to their partner by a series of nods and shakes of the head
how many movements of the head are required?
Show that at least $40$ movements are required.
Give a simple code requiring $52$ movements.
$[$You may assume for simplicity that the player to
whom the information is being communicated does not
look at her own cards. (In fact this does not make
a difference since the two players do not acquire
any shared information by looking at their
own cards.)$]$
(ii) If instead the player uses the initial letters
of words (say using the $16$ most common letters),
how many words will you
need to utter\footnote{`Marked cards, M. l'Anglais?' I said,
with a chilling sneer. 'They are used, I am told,
to trap players--not unbirched schoolboys.'
'Yet I say that they are marked!'
he replied hotly, in his queer foreign jargon.
'In my last hand I had nothing. You doubled the stakes.
Bah, sir, you knew! You have swindled me!'
'Monsieur is easy to swindle
-- when he plays with a mirror behind him,'
I answered tartly. \emph{Under the Red Robe} S.~J.~Weyman}?
\end{question}
\begin{question}\label{C1.9}
(i) In a \emph{comma code}, like
Morse code, one symbol from an alphabet of $m$ letters
is reserved to end each code word. Show that
this code is prefix-free and give a direct
argument to show that it must satisfy Kraft's inequality.
(ii) Give an example of a code satisfying Kraft's inequality which
is not decodable.
\end{question}
\begin{question}\label{C1.10}
Show that if an optimal binary code has
word lengths $s_{1}$, $s_{2}$, \dots $s_{m}$ then
\[m\log_{2} m\leq s_{1}+s_{2}+\dots+s_{m}\leq (m^{2}+m-2)/2.\]
\end{question}
\begin{question}\label{C1.11} (i) It is known that
exactly one member of the
starship Emphasise has contracted the Macguffin virus.
A test is available that will detect the virus at any dilution.
However, the power required is such that the ship's
force shields
must be switched off\footnote{`Captain, ye canna be serious.'}
for a minute during each test. Blood samples are taken
from all crew members. The ship's computer has worked
out that the probability of crew member number $i$
harbouring the virus is $p_{i}$. (Thus the probability that
the captain, who is, of course, number $1$, has the disease
is $p_{1}$.)
Explain how,
by testing pooled
samples, the expected number of tests can be minimised.
Write down the exact form of the test when there are $2^{n}$
crew members and $p_{i}=2^{-n}$.
(ii) Questions like~(i) are rather artificial,
since they require that exactly one person carries the virus.
Suppose that the probability that any member of a population
of $2^{n}$ has a certain disease is $p$ (and that
the probability is independent of the health of the others)
and there exists an error free test which can be carried out on pooled
blood samples which indicates the presence of the
disease in at least one of the samples or its absence from all.
Explain why there cannot be a testing scheme
which can be guaranteed to require less than $2^{n}$
tests to diagnose all members of
the population. How does the scheme suggested
in the last sentence of~(i) need to be modified to take
account of the fact that more than one person
may be ill (or, indeed, no one may be ill)?
Show that the expected number of tests required by
the modified scheme is no greater than
$pn2^{n+1}+1$. Explain why the cost of testing
a large population of size $x$ is no more than about
$2pcx\log_{2} x$ with $c$ the cost of a test.
(iii) In practice, pooling schemes will be less complicated.
Usually a group of $x$ people are tested jointly
and, if the joint test shows the disease, each is tested individually.
Explain why this is not sensible if $p$ is large
but is sensible (with a reasonable choice of $x$)
if $p$ is small.
If $p$ is small, explain why there is an optimum value for $x$
Write down (but do not
attempt to solve)
an equation which indicates (in a `mathematical methods' sense)
that optimum value
in terms of $p$, the probability that an individual has the
disease.
Schemes like these are only worthwhile if the disease is rare and
the test is both expensive and will work on pooled samples.
However, these circumstances do occur together from
time to time and the idea then produces public health
benefits much more cheaply than would otherwise be possible.
\end{question}
\begin{question}\label{C1.12}
(i) Give the appropriate generalisation
of Huffman's algorithm to an alphabet with $a$ symbols
when you have $m$ messages and $m\equiv 1\mod{a-1}$.
(ii) Prove that your algorithm gives an optimal
solution.
(iii) Extend the algorithm to cover general $m$ by
introducing messages of probability zero.
\end{question}
\begin{question}\label{C1.13}
(i) A set of $m$ apparently identical
coins consists of $m-1$ coins and one heavier coin.
You are given a balance in which you can weigh
equal numbers of the coins and determine which side
(if either) contains the heavier coin.
You wish to find the heavy coin in the fewest
average number of weighings.
If $3^{r}+1\leq m\leq 3^{r+1}$ show that you can label
each coin with a ternary number $a_{1}a_{2}\ldots a_{r+1}$
with $a_{j}\in\{0,1,2\}$ in such a way that
the number of coins
having
$1$ in the $j$th place
equals the number of coins with $2$ in the $j$th place
for each $j$
(think Huffman ternary trees).
By considering the Huffman algorithm problem for
prefix-free codes on an alphabet with
three letters, solve the problem stated in the first
part and show that you do indeed
have a solution. Show that your
solution also minimises the maximum number
of weighings that you might have to do.
(ii) Suppose the problem is as before but
$m=12$ and the odd coin may be heavier or lighter.
Show that you need at least $3$ weighings.
[In fact you can always do it in $3$ weighings, but
the problem of showing this
`is said to have been planted
during the war \dots by enemy agents
since Operational Research spent so many man-hours
on its solution.'\footnote{The quotation comes
from Pedoe's \emph{The Gentle Art of Mathematics}
which also gives a very pretty solution.
As might be expected, there are many accounts of this problem
on the web.}]
\end{question}
\begin{question}\label{C1.14} Extend the definition of entropy to
a random variable $X$ taking values in the non-negative
integers. (You must allow for the possibility
of infinite entropy.)
Compute the expected value ${\mathbb E}Y$ and
entropy $H(Y)$ in the case when $Y$ has the geometric
distribution, that is to say
$\Pr(Y=k)=p^{k}(1-p)$
$[0*1/2$? What
if $p=1/2$?
\end{question}
\begin{question}\label{C2.3} (Exercise~\ref{ISBN}.)
If you look
at the inner
title page of almost any book
published between 1974 and 2007,
you will find its International Standard
Book Number (ISBN). The ISBN
uses single digits selected from 0, 1, \dots, 8, 9
and $X$ representing 10. Each ISBN consists
of nine such digits $a_{1}$, $a_{2}$, \dots, $a_{9}$
followed by a single check digit $a_{10}$ chosen
so that
\begin{equation*}
10a_{1}+9a_{2}+ \dots+2a_{9}+a_{10}\equiv 0\mod{11}.\tag*{(*)}
\end{equation*}
(In more sophisticated language, our code $C$ consists
of those elements ${\mathbf a}\in {\mathbb F}_{11}^{10}$
such that $\sum_{j=1}^{10}(11-j)a_{j}=0$.)
(i) Find a couple of books
and check that $(*)$ holds for their ISBNs.
(ii) Show that $(*)$ will not work if you make a mistake
in writing down one digit of an ISBN.
(iii) Show that
$(*)$ may fail to detect two errors.
(iv) Show that $(*)$ will not work if you interchange
two distinct adjacent digits (a transposition error).
(v) Does~(iv) remain true if we replace `adjacent'
by `different'?
\noindent Errors of type (ii) and (iv) are the most common
in typing.
In communication between publishers and booksellers,
both sides are anxious that errors should be detected
but would prefer the other side to query errors
rather than to guess what the error might have been.
(vi) Since the ISBN contained information such as the name of the
publisher, only a small proportion of possible ISBNs could
be used\footnote{The same problem occurs with telephone numbers.
If we use the Continent, Country, Town, Subscriber system
we will need longer numbers than if we just numbered
each member of the human race.}
and the system described above
started to `run out of numbers'. A new system
was introduced which was
is compatible with the system used to label most consumer goods.
After January 2007, the appropriate ISBN became a $13$ digit number
$x_{1}x_{2}\dots x_{13}$ with each digit
selected from $0$, $1$, \dots, $8$, $9$ and
the check digit $x_{13}$ computed by using the formula
\[x_{13}\equiv -(x_{1}+3x_{2}+x_{3}+3x_{4}+\cdots+x_{11}+ 3x_{12})
\mod{10}.\]
Show that we can detect single errors. Give an example
to show that we cannot detect all transpositions.
\end{question}
\begin{question}\label{C2.4}
(Exercise~\ref{bacon}.)
Suppose we use eight hole tape with
the standard paper tape code
and the probability that an error occurs at a particular
place on the tape (i.e. a hole occurs where it should
not or fails to occur where it should) is $10^{-4}$.
A program requires about 10\,000 lines of tape
(each line containing eight places)
using the paper tape code. Using
the Poisson approximation, direct calculation
(possible with a hand calculator but really no
advance on the Poisson method), or otherwise,
show that the probability that the tape
will be accepted as error free by the decoder
is less than .04\%.
Suppose now that we use the Hamming scheme
(making no use of the last place in each line).
Explain why the program requires about
17\,500 lines of tape but that any
particular line will be correctly decoded
with probability about $1-(21\times 10^{-8})$
and the probability that the entire program
will be correctly decoded is better than
99.6\%.
\end{question}
\begin{question}\label{C2.5}
If $0<\delta<1/2$,
find an $A(\delta)>0$ such that, whenever
$0\leq r\leq n\delta$, we have
\[\sum_{j=0}^{r}\binom{n}{j}\leq A(\delta)\binom{n}{r}.\]
(We use weaker estimates in the course but this is
the most illuminating. The particular value of $A(\delta)$
is unimportant so do not waste time trying to find
a `good' value.)
\end{question}
\begin{question}\label{C2.6}
Show that the $n$-fold repetition code
is perfect if and only if $n$ is odd.
\end{question}
\begin{question}\label{C2.7}
(i) What is the expected Hamming distance
between two randomly chosen code words in ${\mathbb F}_{2}^{n}$.
(As usual we suppose implicitly that the two choices are independent
and all choices are equiprobable.)
(ii) Three code words are chosen at random
from ${\mathbb F}_{2}^{n}$. If $k_{n}$ is
the expected value of the distance between the
closest two,
show that $n^{-1}k_{n}\rightarrow 1/2$ as
$n\rightarrow\infty$.
\noindent[There are many ways to do~(ii). One way is to
consider Tchebychev's inequality.]
\end{question}
\begin{question}\label{C2.8}
(Exercises~\ref{E;fast Kelly}
and~\ref{E;Slow Kelly}.) Consider the situation
described in the first paragraph of Section~\ref{S;race track}.
(i) Show that for the situation described
you should not bet if $up\leq 1$ and should take
\[w=\frac{up-1}{u-1}\]
if $up>1$.
(ii) Let us write $q=1-p$. Show that, if $up>1$ and we choose
the optimum $w$,
\[{\mathbb E}\log Y_{n}=p\log p+q\log q+\log u-q\log(u-1).\]
(iii) Show that,
if you bet less than the optimal proportion, your fortune
will still tend to increase but more slowly, but, if you bet
more than some proportion $w_{1}$, your fortune will decrease.
Write down the equation for $w_{1}$.
[Moral: If you use the Kelly criterion veer on the side under-betting.]
\end{question}
\begin{question}\label{C2.9} Your employer announces that he is
abandoning the old-fashioned paternalistic scheme
under which he guarantees you a fixed sum $Kx$
(where, of course, $K,\,x>0$)
when you retire. Instead, he will empower you by
giving you a fixed sum $x$ now, to invest as you wish.
In order to help you
and the rest of the staff, your employer
arranges that you should obtain advice
from a financial whizkid with a top degree
from Cambridge. After a long lecture in which the whizkid
manages to be simultaneously condescending, boring
and incomprehensible, you come away with the following information.
When you retire, the world will be in exactly
one of $n$ states. By means of a piece of financial
wizardry called ditching (or something like that)
the whizkid can offer you a pension plan which
for the cost of $x_{i}$ will return $Kx_{i}q_{i}^{-1}$
if the world is in state $i$, but nothing otherwise.
(Here $q_{i}> 0$ and $\sum_{i=1}^{n}q_{i}=1$.)
The probability that the world will be in state $i$
is $p_{i}$. You must invest the entire fixed
sum. (Formally, $\sum_{i=1}^{n}x_{i}=x$. You must also take
$x_{i}\geq 0$.) On philosophical grounds you decide
to maximise the expected value $S$ of the logarithm
of the sum received on retirement.
Assuming that you will have to live
off this sum for the rest of your life,
explain, \emph{in your opinion}, why this choice is
reasonable or explain why it is unreasonable.
Find the appropriate choices of $x_{i}$.
Do they depend on the $q_{i}$?
Suppose that $K$ is fixed, but the whizkid can choose
$q_{i}$. We may suppose that what is good for you is bad for him
so
he will seek to minimise $S$ for your best choices.
Show that he will choose $q_{i}=p_{i}$.
Show that,
with these choices,
\[S=\log Kx.\]
\end{question}
\begin{question}\label{C2.10}
Let $C$ be the code consisting of the
word 10111000100 and its cyclic shifts (that is
01011100010, 00101110001 and so on) together with
the zero code word. Is $C$ linear? Show that $C$
has minimum distance 5.
\end{question}
\begin{question}\label{C2.11}
(i) The original Hamming code
was a $7$ bit code used in an $8$ bit system
(paper tape). Consider the code $c:\{0,1\}^{4}\rightarrow\{0,1\}^{8}$
obtained by using the Hamming code for the first $7$ bits
and the final bit as a check digit so that
\[x_{1}+x_{2}+\dots+x_{8}\equiv 0\mod{2}.\]
Find the minimum distance for this code. How many errors
can it detect? How many can it correct?
(ii) Given a code of length $n$ which corrects $e$ errors
can you always construct a code of length $n+1$
which detects $2e+1$ errors?
\end{question}
\begin{question}\label{C2.12} In general,
we work under the assumption
that all messages sent through our noisy channel
are equally likely. In this question
we drop this assumption. Suppose that
each bit sent through a channel has probability
$1/3$ of being mistransmitted. There are $4$
codewords $1100$, $0110$, $0001$, $1111$
sent with probabilities $1/4$, $1/2$, $1/12$, $1/6$.
If you receive $1001$ what will you decode it as, using
each of the following rules?
(i) The ideal observer rule: find ${\mathbf b}\in C$
so as to maximise
\[\Pr({\mathbf b}\ \text{sent}\,|\,
{\mathbf u}\ \text{received}\}.\]
(ii) The maximum likelihood rule: find ${\mathbf b}\in C$
so as to maximise
\[\Pr({\mathbf u}\ \text{received}\,|\,
{\mathbf b}\ \text{sent}\}.\]
(iii) The minimum distance rule: find ${\mathbf b}\in C$
so as to minimise the Hamming distance
$d({\mathbf b},{\mathbf u})$ from the
received message ${\mathbf u}$.
\end{question}
\begin{question}\label{E;random ball}\label{C2.13}
(i) Show that $-t\geq \log(1-t)$ for $0\leq t<1$.
(ii) Show that, if $\delta_{N}>0$, $1-N\delta_{N}>0$ and
$N^{2}\delta_{N}\rightarrow\infty$, then
\[\prod_{m=1}^{N-1}(1-m\delta_{N})\rightarrow 0.\]
(iii) Let $V(n,r)$ be the number of points in a Hamming
ball of radius $r$ in ${\mathbb F}_{2}^{n}$ and
let $p(n,N,r)$ be the probability that $N$ such
balls chosen at random do not intersect.
By observing that if $m$ non-intersecting
balls are already placed, then an $m+1$st ball
which does not intersect them
must certainly not have its centre
in one of the balls already placed,
show that, if $N_{n}^{2}2^{-n}V(n,r_{n})\rightarrow\infty$,
then $p(n,N_{n},r_{n})\rightarrow 0$.
(iv) Show that,
if $2\beta+H(\alpha)>1$,
then $p(n,2^{\beta n},\alpha n)\rightarrow 0$.
Thus simply throwing balls down at random will not
give very good systems of balls with empty intersections.
\end{question}
\newpage
\section{Exercise Sheet 3}
\begin{question}\label{C3.1}
A message passes through a binary symmetric
channel with probability $p$ of error for each bit
and the resulting message is passed through a
second binary symmetric
channel which is identical except
that there is probability $q$ of error $[0

d+k-1$.
\end{question}
\begin{question}\label{C3.15} Implement the secret sharing method
of page~\pageref{P;secret sharing} with $k=2$, $n=3$, $x_{j}=j+1$
$p=7$, $a_{0}=S=2$, $a_{1}=3$. Check directly that any two
people
can find $S$ but no single individual can.
If we take $k=3$, $n=4$,
$p=6$, $x_{j}=j+1$ show that
the first two members and the
fourth member of the Faculty Board will
be unable to determine $S$ uniquely.
Why does this not invalidate our method?
\end{question}
\newpage
\section{Exercise Sheet 4}
\begin{question} (Exercise~\ref{E;rational}.)\label{C4.1}
Show that the decimal expansion of
a rational number must be a recurrent expansion.
Give a bound for the period in terms of the quotient.
Conversely, by considering geometric series, or otherwise,
show that a recurrent decimal represents
a rational number.
\end{question}
\begin{question}\label{C4.2}
A binary non-linear feedback register of length 4
has defining relation
\[x_{n+1}=x_{n}x_{n-1}+x_{n-3}.\]
Show that the state space contains 4 cycles of lengths
1, 2, 4 and 9
\end{question}
\begin{question}\label{C4.3}
A binary LFR was used to generate
the following stream
\[110001110001\dots\]
Recover the feedback polynomial by the Berlekamp--Massey
method. [The LFR has length $4$ but you should work through
the trials for length $r$ for $1\leq r\leq 4$.]
\end{question}
\begin{question}\label{C4.4}
(Exercise~\ref{Solve recurrence}.)
Consider the linear recurrence
\begin{equation*}
x_{n}=a_{0}x_{n-d}+a_{1}x_{n-d+1}+\ldots+a_{d-1}x_{n-1}\tag*{$\bigstar$}
\end{equation*}
with $a_{j}\in {\mathbb F}_{2}$ and $a_{0}\neq 0$.
(i) Suppose $K$ is a field containing ${\mathbb F}_{2}$
such that the auxiliary polynomial $C$ has a root $\alpha$
in $K$. Show that $x_{n}=\alpha^{n}$ is a solution of $\bigstar$ in $K$.
(ii) Suppose $K$ is a field containing ${\mathbb F}_{2}$
such that the auxiliary polynomial $C$ has
$d$ distinct roots $\alpha_{1}$, $\alpha_{2}$,
\dots, $\alpha_{d}$ in $K$. Show that the general solution
of $\bigstar$ in $K$ is
\[x_{n}=\sum_{j=1}^{d}b_{j}\alpha_{j}^{n}\]
for some $b_{j}\in K$.
If $x_{0},x_{1},\dots,x_{d-1}\in {\mathbb F}_{2}$,
show that $x_{n}\in {\mathbb F}_{2}$ for all $n$.
(iii) Work out the first few lines of Pascal's triangle
modulo 2. Show that the functions
$f_{j}:{\mathbb Z}\rightarrow{\mathbb F}_{2}$
\[f_{j}(n)=\binom{n}{j}\]
are linearly independent in the sense that
\[\sum_{j=0}^{m}b_{j}f_{j}(n)=0\]
for all $n$ implies $b_{j}=0$ for $1\leq j\leq m$.
(iv) Suppose $K$ is a field containing ${\mathbb F}_{2}$
such that the auxiliary polynomial $C$ factorises
completely into linear factors. If the
root $\alpha_{u}$ has multiplicity $m(u)$ $[1\leq u\leq q]$,
show that the general solution
of $\bigstar$ in $K$ is
\[x_{n}=\sum_{u=1}^{q}\sum_{v=0}^{m(u)-1}
b_{u,v}\binom{n}{v}\alpha_{u}^{n}\]
for some $b_{u,v}\in K$.
If $x_{0},x_{1},\dots,x_{d-1}\in {\mathbb F}_{2}$,
show that $x_{n}\in {\mathbb F}_{2}$ for all $n$.
\end{question}
\begin{question}\label{C4.5}
Consider the recurrence relation
\[u_{n+p}+\sum_{j=0}^{n-1}c_{j}u_{j+p}=0\]
over a field (if you wish, you may take the field
to be ${\mathbb R}$
but the algebra is the same for all fields.)
We suppose $c_{0}\neq 0$.
Write down an $n\times n$
matrix $M$ such that
\[\left(\begin{matrix}
u_{1}\\u_{2}\\ \vdots\\u_{n}
\end{matrix}\right)
=M\left(\begin{matrix}
u_{0}\\u_{1}\\ \vdots\\u_{n-1}
\end{matrix}\right).\]
Find the characteristic and minimal polynomials for $M$.
Would your answers be the same if $c_{0}=0$?
\end{question}
\begin{question}\label{C4.6}
(Exercise~\ref{Fish}.)
One of the most confidential
German codes (called FISH by the British)
involved a complex mechanism which
the British found could be simulated
by two loops of paper tape of
length $1501$ and $1497$. If $k_{n}=x_{n}+y_{n}$
where $x_{n}$ is a stream of period $1501$
and $y_{n}$ is a stream of period $1497$,
what is the longest possible period of $k_{n}$?
How many consecutive values of $k_{n}$ would you
need to to find the underlying linear feedback register
using the Berlekamp--Massey method if you did
not have the information given in the question?
If you had
all the information given in the question
how many values of $k_{n}$ would you need?
(Hint, look at $x_{n+1497}-x_{n}$.)
You have shown that, given $k_{n}$ for sufficiently
many consecutive $n$ we can find $k_{n}$ for all $n$.
Can you find $x_{n}$ for all $n$?
\end{question}
\begin{question}\label{C4.7}
We work in ${\mathbb F}_{2}$.
I have a secret sequence $k_{1}$, $k_{2}$, \dots and a
message $p_{1}$, $p_{2}$, \dots, $p_{N}$. I transmit
$p_{1}+k_{1}$, $p_{2}+k_{2}$, \dots $p_{N}+k_{N}$
and then, by error, transmit
$p_{1}+k_{2}$, $p_{2}+k_{3}$, \dots $p_{N}+k_{N+1}$.
Assuming that you know this and that my
message makes sense, how would you go about
finding my message? Can you now decipher
other messages sent using the same
part of my secret sequence?
\end{question}
\begin{question}\label{C4.8}
Give an example of a homomorphism attack
on an RSA code. Show in reasonable
detail that the Elgamal
signature scheme defeats it.
\end{question}
\begin{question}\label{C4.9}
I announce that I shall
be using the Rabin--Williams scheme with
modulus $N$. My agent in X'Dofdro sends
me a message $m$ (with $1\leq m\leq N-1$)
encoded in the requisite form.
Unfortunately, my cat eats the piece of paper
on which the prime factors of $N$ are
recorded, so I am unable to decipher it.
I therefore find a new pair of primes
and announce that I shall
be using the Rabin--Williams scheme with
modulus $N'>N$. My agent now recodes
the message and sends it to me again.
The dreaded SNDO of X'Dofdro intercept
both code messages. Show that they
can find $m$. Can they decipher any
other messages sent to me using only
one of the coding schemes?
\end{question}
\begin{question}\label{C4.10}
Extend the Diffie--Hellman key exchange system
to cover three participants in a way that is
likely to be as secure as the two party scheme.
Extend the system to $n$ parties in such a way that
they can compute their common secret key by at
most $n^{2}-n$ communications of `Diffie--Hellman type numbers'.
(The numbers $p$
and $g$ of our original Diffie-Hellman system are
known by everybody in advance.) Show that this can be done using
at most $2n-2$ communications by including several
`Diffie--Hellman type numbers' in one message.
\end{question}
\begin{question}\label{E;abacus}\label{C4.11}
St Abacus, who established written Abacan,
was led, on theological grounds, to use an alphabet containing
only
three letters $A$, $B$ and $C$ and to avoid the use of
spaces. (Thus an Abacan book consists of single word.)
In modern Abacan, the letter
$A$ has frequency $.5$ and the letters $B$ and $C$ both
have frequency $.25$. In order to disguise this,
the Abacan Navy uses codes in which the
$3r+i$th number is $x_{3r+i}+y_{i}$ modulo $3$ $[0\leq i\leq 2]$
where $x_{j}=0$ if the $j$th letter of the message is $A$,
$x_{j}=1$ if the $j$th letter of the message is $B$,
$x_{j}=2$ if the $j$th letter of the message is $C$
and $y_{0}$, $y_{1}$ and $y_{2}$ are the numbers $0$, $1$, $2$
in some order.
Radio interception has picked up the following message.
\[120022010211121001001021002021\]
Although nobody in Naval Intelligence reads Abacan,
it is believed that the last letter of the message will
be $B$ if the Abacan fleet is at sea.
The Admiralty are desperate to know the last letter
and send a representative to your rooms
in Baker Street to ask your advice. Give it.
\end{question}
\begin{question}\label{Q;Alice cheats}\label{C4.12}
Consider the bit exchange scheme
proposed at the end of Section~\ref{S;symmetric}.
Suppose that we replace STEP~5 by:- Alice sends Bob
$r_{1}$ and $r_{2}$ and Bob checks that
\[r_{1}^{2}\equiv r_{2}^{2}\equiv m\mod{n}.\]
Suppose further that Alice cheats by
choosing $3$ primes $p_{1}$, $p_{2}$, $p_{3}$,
and sending Bob $p=p_{1}$ and $q=p_{2}p_{3}$.
Explain how Alice can shift the odds of heads
to $3/4$. (She has other ways of cheating, but
you are only asked to consider this one.)
\end{question}
\begin{question}\label{C4.13}
(i) Consider the Fermat code
given by the following procedure.
`Choose $N$ a large prime. Choose $e$ and $d$ so that
$a^{de}\equiv a \mod{N}$, encrypt using the publicly known
$N$ and $e$, decrypt using the secret $d$.'
Why is this not a good code?
(ii) In textbook examples of the RSA code we frequently see
$e=65537$. How many multiplications
are needed to
compute $a^{e}$ modulo $N$?
(iii) Why is it unwise to choose primes $p$ and $q$
with $p-q$ small when forming $N=pq$ for the RSA method?
Factorise $1763$.
\end{question}
\begin{question}\label{C4.14} The University of Camford is proud of
the excellence of its privacy system CAMSEC.
To advertise this
fact to the world, the Vice-Chancellor decrees that
the university telephone directory should bear on its cover
a number $N$ (a product of two very large secret primes)
and each name in the University Directory should
be followed by their personal encryption number $e_{i}$.
The Vice-Chancellor knows all the secret decryption numbers
$d_{i}$ but gives these out on a need to know basis only.
(Of course each member of staff must know their
personal decryption number but they are instructed to keep it secret.)
Messages $a$ from the Vice-Chancellor
to members of staff
are encrypted in the standard manner
as $a^{e_{i}}$ modulo $N$ and decrypted
as $b^{d_{i}}$ modulo $N$.
(i) The Vice-Chancellor sends a message to
all members of the University.
An outsider intercepts the encrypted message to
individuals $i$ and $j$ where $e_{i}$ and $e_{j}$
are coprime. How can the outsider read the message?
Can she read other messages sent from the Vice-Chancellor
to the $i$th member of staff only?
(ii) By means of a phone tapping device,
the Professor of Applied Numismatics
(number $u$ in the University Directory)
has intercepted messages from the
Vice-Chancellor to her
hated rival, the
Professor of Pure Numismatics
(number $v$ in the University Directory).
Explain why she can decode them.
What moral should be drawn?
\end{question}
\begin{question}\label{C4.15}
The Poldovian Embassy uses a one-time pad
to communicate with the notorious international spy Ivanovich Smith.
The messages are coded in the obvious way. (If the pad has $C$ the
$3$rd letter of the alphabet and the message has $I$ the $9$th
then the encrypted message has $L$ the $3+9$th. Work modulo $26$.)
Unknown to them, the person whom they employ
to carry the messages is actually the MI5 agent
`Union' Jack Caruthers in disguise.
MI5 are on the verge of arresting Ivanovich
when `Union' Jack is given the message
\[LRPFOJQLCUD.\]
Caruthers knows that the actual message is
\[FLYXATXONCE\]
and suggests that `the boffins change things a little'
so that Ivanovich deciphers the message as
\[REMAINXHERE.\]
The only boffin available is you. Advise MI5.
\end{question}
\begin{question}\label{E;add}\label{C4.16}
Suppose that $X$ and $Y$ are independent random variables
taking values in ${\mathbb Z}_{n}$. Show that
\[H(X+Y)\geq \max\{H(X),H(Y)\}.\]
Why is this remark of interest in the context of one-time pads?
Does this result remain true if $X$ and $Y$ need not be independent?
Give a proof or counterexample.
\end{question}
\begin{question}\label{C4.17}
I use the Elgamal signature scheme
described on page~\pageref{P;Elgamal}. Instead of choosing $k$
at random, I increase the value used by $2$ each time I use it.
Show that it will often be possible to find
my privacy key $u$ from two
successive messages.
\end{question}
\begin{question}\label{C4.18}
Confident in the unbreakability of
RSA, I write the following. What mistakes have I made?
\begin{center}
0000001 0000000 0002048 0000001 1391142\\
0000000 0177147 1033288 1391142 1174371.
\end{center}
Advise me on how to increase the security of messages.
\end{question}
\begin{question}\label{E;maximum period}\label{C4.19}
Let $K$ be the finite field with $2^{d}$ elements
and primitive root $\alpha$. (Recall that $\alpha$ is
a generator of the cyclic group $K\setminus\{0\}$
under multiplication.) Let $T:K\rightarrow{\mathbb F}_{2}$
be a non-zero linear map. (Here we treat $K$ as a vector
space over ${\mathbb F}_{2}$.)
(i) Show that the map $S:K\times K \rightarrow{\mathbb F}_{2}$
given by $S(x,y)=T(xy)$ is a symmetric bilinear form.
Show further that $S$ is non-degenerate
(that is to say $S(x,y)=0$ for all $x$ implies $y=0$).
(ii) Show that the sequence $x_{n}=T(\alpha^{n})$
is the output from a linear feedback register of length
at most $d$. (Part~(iii) shows that it must be exactly $d$.)
(iii) Show that the period of the system (that is to say the minimum
period of $T$) is $2^{d}-1$. Explain briefly why this
is best possible.
\end{question}
\end{document}