1.1 Languages Recognised by Machines

Languages, parsing, compilers etc are the chief motivation for this course. You might think that a language is something like English, or Spanish, or perhaps (since this is a Computer Science course) PASCAL or JAVA or something like that: a naturally occurring set of strings-of-letters with some natural definition and a sensible purpose such as meaning something! Who could blame you? It's a very reasonable thing to expect. Unfortunately for us the word 'language' has been hijacked by the formal-language people to mean something much more general than that. What they mean is the following.

We start with an alphabet Σ (and for some reason they always are called Σ, don't ask me why) which is a finite set of characters. We are interested in strings of characters from the alphabet in question. A string of characters is not the same as a set of characters. Sets do not have order - the set {a, b} is the same set as the set {b, a} but strings do: the string ab is not the same as the string ba. Sets do not have multiplicity: the set {a, a, b} is the same set as the set {a, b} (which is why nobody writes things like '{a, a, b}') but strings definitely do have multiplicity: the string aab is not the same string as the string ab. Also, slightly confusingly, although we have this '{' and '}' notation for sets, there is no corresponding delimiter for strings. Notation was not designed rationally, but just evolved haphazardly.

We write 'Σ*' for the set of all strings formed from characters in Σ, and then a language ("over Σ") is just a subset of Σ* : any subset at all. A subset of Σ* doesn't have to have a sensible definition in order to be called a language by the formal-language people. While we are about it, notice also that people use the word 'word' almost interchangeably with 'string'.

A word of warning. Many people confuse ⊆ and ∊, and there is a parallel confusion that occurs here. (This is not the same as the confusion between sets and strings: this is the common confusion between sets and their members!) A language is not a string: it is a set of strings. This may be because they are thinking that strings are sets of characters and that accordingly a language - on this view - is a set of sets. If in addition you think that everything there is to be said about sets can be drawn in Venn diagrams, this will confuse you. Venn diagrams give you pictures of sets of points, but not sets of sets. A Venn diagram picture involving three levels of sets is impossible.

We will write '|w|' for the length of the string w. This may remind you of the notation '|A|' for the number of elements of A when A is a set.

(Aside on notation: the use of the asterisk in this way to mean something like "finite repetitions of" is widespread. You might connect this with the notation 'R*' for the transitive closure of R. We will also see the notation 'δ* ' for the function that takes a state s, and a string w and returns the state that the machine reaches if we start it in s and then perform successively all the transitions mandated by the characters in w. Thus, for example, for characters a, b and c, δ* (s, ab) = δ(δ(s, a), b) and δ* (s, abc) = δ(δ(δ(s, a), b), c) and so on.

In fact δ* : S ⨯ Σ → S. It's easier than it looks!

Although a language is any old subset of Σ* , on the whole we are interested only in languages that are infinite. And here I find that students need to be tipped off about the need to be careful. Print out this warning and pin it to the wall:


The languages we are interested in are usually infinite, but the strings in them are always finite!


Let's have some examples. Let Σ be the alphabet {a, b}.

The language {aa, ab, ba, bb} is the language of two-letter words over Σ. {an : n ∊ ℕ} is the set of all strings consisting entirely of a's which is the (yes, it's infinite) language {ε, a, aa, aaa, . . .}. {an bn : n ∊ ℕ} is the set of all strings that consist of a's followed by the same number of b's.

Question: What might the symbol 'ε' mean above (in "{ε, a, aa, aaa, . . .}").?

Click here for the answer

Mathematics is full of informal conventions that people respect because they find them helpful. Some of them apply here, and we will respect them.

  1. We tend to use letters from near the beginning of the alphabet, a, b, . . . for characters in alphabets;
  2. We tend to use lower-case letters from near the end of the alphabet - like 'u', 'v' and 'w'- for variables to range over strings;
  3. q is often a variable used to vary over states;
  4. We tend to use upper-case letters from the middle of the alphabet - like 'K', 'L' - for variables to range over languages.

If w and u are strings then wu is w with u stuck on the end, and uw is u with w stuck on the end. Naturally |uw| = |wu| = |w| + |u|. ε is just the empty string so εw - the empty string with w concatenated on the end - is just w. So ww, www and wwwww are respectively the result of stringing two, three and five copies of w together. Instead of writing 'wwwww' we write 'w5' and we even use this notation for variable exponents, so that wn is n copies of w strung together.

Next: 1.1.1 Languages from machines
Back: 1 Machines