Kleene's theorem states that a language is regular if it can be notated by a regular expression. One direction of this is fairly easy: showing that if there is a regular expression for a language L then there is a machine that recognises L. This breaks down into several steps, one for each constructor: slash, concatenation and Kleene star. We've seen how to do slash - after all L(K1|K2 ) is just L(K1)∪L(K2) - but to do the other two involves nondeterministic machines and we don't encounter those until later.
The hard part is the other direction: showing how to find a regular expression for the language recognised by a given machine.
What we prove is something apparently much stronger:
For every machine ℳ, for any two states q1 and q2 of ℳ, and for any set
Q of states, there is a regular expression φ(q1, q2, Q) which notates the set of
strings that take us from state q1 to state q2 while never leaving the set Q.
Of course all we are after is the regular expression formed by putting slashes between all the expression φ(q1, q2, Q) where q1 is the initial state, q2 is an accepting state, and Q is the set of all states. But it turns out that the only way to prove this special case is to prove the much more general assertion.
We prove this general assertion by induction. The only way to have a hope of understanding this proof is to be quite clear about what it is we are proving by induction. You are probably accustomed to having 'n' as the induction variable in your proofs by induction so let's do that here.
We fix ℳ once for all (so that we are doing a '∀-introduction rule, or "universal generalisation" on the variable 'ℳ', and we prove by induction on 'n' that this is true for all n.
At the risk of tempting fate, I am inclined to say at this point that if you are happy with what has gone so far in this section (and that is quite a big if!) then you have done all the hard work. The proof by induction is not very hard. The hard part lay in seeing that you had to prove the more general assertion first and then derive Kleene's theorem as a consequence.
Proofs by induction all have two parts. (i) A base case, and (ii) induction step. I submit, ladies and gentlemen, that the base case - with n = 1 - is obvious. Whatever ℳ, s and q are, you can either get from q1 to q2 in one hop by means of - character c, say (in which case the regular expression is c) - or you can't, in which case the regular expression is ε.
Now let's think about the induction step. Suppose our assertion true for n. We want to prove it true for n + 1.
We are given a machine ℳ, and two states q1 and q2 of ℳ. We want to show that for any set Q of states of ℳ, with |Q| = n + 1, there is a regular expression that captures the strings that take the machine from q1 to q2 without leaving Q.
What are we allowed to assume? The induction hypothesis tells us that for any two states s' and t' and any set Q' of states with |Q' | = n, there is a regular expression that captures the strings that take the machine from q'1 to q'2 without leaving Q' . (I have written 'q'1' and 'q'2' and Q' because I don't want to reuse the same variables!)
For any state r in Q, we can reason as follows: "Every string that takes ℳ from q1 to q2 without leaving Q either goes through r or it doesn't.
The strings that take ℳ from q1 to q2 without either going through r or leaving Q are captured by a regular expression because |Q \ {r}| = n. Let w1 be this regular expression.
The strings that take ℳ from q1 to q2 via r are slightly more complicated. By induction hypothesis we have a regular expression for the set of strings that take ℳ from q1 to r without going through q2 (while remaining in Q) - because |Q \ {q2 }| = n - so let's call that regular expression w2. Similarly by induction hypothesis we have a regular expression for the set of strings that take ℳ from r to q2 without going through q1 (while remaining in Q) - because |Q\{q1 }| = n - so let's call that regular expression w3. Finally by induction hypothesis we have a regular expression for the set of strings that take ℳ from r back to r without going through q1 or q2 (while remaining in Q) - because |Q\{q1, q2 }| = n-1 - so let's call that regular expression w4.
Now a string that takes ℳ from q1 to q2 via r will consist of a segment that takes ℳ from q1 to r (captured by w2) followed by a bit that takes it from r back to r any number of times (captured by w4*) followed by a bit that takes ℳ from r to q2 (captured by w3).
So the regular expression we want is w1 |w2 (w4 )*w3.
This concludes the proof.
If you are a confident and fluent programmer in a language that handles strings naturally then you should try to program the algorithm on which this proof relies. It will give you good exercise in programming and will help you understand the algorithm.
Next: 2.2.1 Some more exercises