Kleene's theorem

2.2 Kleene's theorem

Kleene's theorem states that a language is regular if it can be notated by a regular expression. One direction of this is fairly easy: showing that if there is a regular expression for a language L then there is a machine that recognises L. This breaks down into several steps, one for each constructor: slash, concatenation and Kleene star. We've seen how to do slash - after all L(K₁|K₂ ) is just L(K₁)∪L(K₂) - but to do the other two involves nondeterministic machines and we don't encounter those until later.

The hard part is the other direction: showing how to find a regular expression for the language recognised by a given machine.

What we prove is something apparently much stronger:
For every machine ℳ, for any two states q₁ and q₂ of ℳ, and for any set Q of states, there is a regular expression φ(q₁, q₂, Q) which notates the set of strings that take us from state q₁ to state q₂ while never leaving the set Q.

Of course all we are after is the regular expression formed by putting slashes between all the expression φ(q₁, q₂, Q) where q₁ is the initial state, q₂ is an accepting state, and Q is the set of all states. But it turns out that the only way to prove this special case is to prove the much more general assertion.

We prove this general assertion by induction. The only way to have a hope of understanding this proof is to be quite clear about what it is we are proving by induction. You are probably accustomed to having 'n' as the induction variable in your proofs by induction so let's do that here.

"For all machines ℳ, and for all subsets Q of the set of ℳ's states with |Q| = n, and for any two states q₁ and q₂ of ℳ, there is a regular expression φ(q₁, q₂, Q) which notates the set of strings that take us from state q₁ to state q₂ while never leaving the set Q"

We fix ℳ once for all (so that we are doing a '∀-introduction rule, or "universal generalisation" on the variable 'ℳ', and we prove by induction on 'n' that this is true for all n.

At the risk of tempting fate, I am inclined to say at this point that if you are happy with what has gone so far in this section (and that is quite a big if!) then you have done all the hard work. The proof by induction is not very hard. The hard part lay in seeing that you had to prove the more general assertion first and then derive Kleene's theorem as a consequence.

Proofs by induction all have two parts. (i) A base case, and (ii) induction step. I submit, ladies and gentlemen, that the base case - with n = 1 - is obvious. Whatever ℳ, s and q are, you can either get from q₁ to q₂ in one hop by means of - character c, say (in which case the regular expression is c) - or you can't, in which case the regular expression is ε.

Now let's think about the induction step. Suppose our assertion true for n. We want to prove it true for n + 1.

We are given a machine ℳ, and two states q₁ and q₂ of ℳ. We want to show that for any set Q of states of ℳ, with |Q| = n + 1, there is a regular expression that captures the strings that take the machine from q₁ to q₂ without leaving Q.

What are we allowed to assume? The induction hypothesis tells us that for any two states s' and t' and any set Q' of states with |Q' | = n, there is a regular expression that captures the strings that take the machine from q'₁ to q'₂ without leaving Q' . (I have written 'q'₁' and 'q'₂' and Q' because I don't want to reuse the same variables!)

For any state r in Q, we can reason as follows: "Every string that takes ℳ from q₁ to q₂ without leaving Q either goes through r or it doesn't.

The strings that take ℳ from q₁ to q₂ without either going through r or leaving Q are captured by a regular expression because |Q \ {r}| = n. Let w₁ be this regular expression.

The strings that take ℳ from q₁ to q₂ via r are slightly more complicated. By induction hypothesis we have a regular expression for the set of strings that take ℳ from q₁ to r without going through q₂ (while remaining in Q) - because |Q \ {q₂ }| = n - so let's call that regular expression w₂. Similarly by induction hypothesis we have a regular expression for the set of strings that take ℳ from r to q₂ without going through q₁ (while remaining in Q) - because |Q\{q₁ }| = n - so let's call that regular expression w₃. Finally by induction hypothesis we have a regular expression for the set of strings that take ℳ from r back to r without going through q₁ or q₂ (while remaining in Q) - because |Q\{q₁, q₂ }| = n-1 - so let's call that regular expression w₄.

Now a string that takes ℳ from q₁ to q₂ via r will consist of a segment that takes ℳ from q₁ to r (captured by w₂) followed by a bit that takes it from r back to r any number of times (captured by w₄*) followed by a bit that takes ℳ from r to q₂ (captured by w₃).

So the regular expression we want is w₁ |w₂ (w₄ )*w₃.

This concludes the proof.

If you are a confident and fluent programmer in a language that handles strings naturally then you should try to program the algorithm on which this proof relies. It will give you good exercise in programming and will help you understand the algorithm.

Next: 2.2.1 Some more exercises
Back: 2.1.1 More about bombs