updated
authorChristian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 27 Sep 2013 10:53:29 +0100
changeset 110 9353308f1c6a
parent 109 f2a90dda7e3b
child 111 1933e88cb73e
updated
handouts/ho01.pdf
handouts/ho01.tex
Binary file handouts/ho01.pdf has changed
--- a/handouts/ho01.tex	Thu Sep 26 15:17:38 2013 +0100
+++ b/handouts/ho01.tex	Fri Sep 27 10:53:29 2013 +0100
@@ -11,35 +11,38 @@
 
 \section*{Handout 1}
 
-This course is about processing of strings. Lets start with what we mean by \emph{string}. Strings
-are lists of characters drawn from an \emph{alphabet}. If nothing else is specified, we usually assume 
-the alphabet are letters $a$, $b$, \ldots, $z$ and $A$, $B$, \ldots $Z$. Sometimes we explicitly
-restrict strings to only contain the letters $a$ and $b$. Then we say the alphabet is the set $\{a, b\}$.
+This course is about the processing of strings. Lets start with what we mean by \emph{strings}. Strings
+(they are also sometimes referred to as \emph{words}) are lists of characters drawn from an \emph{alphabet}. 
+If nothing else is specified, we usually assume 
+the alphabet consists of just the lower-case letters $a$, $b$, \ldots, $z$. Sometimes, however, we explicitly
+restrict strings to contain, for example, only the letters $a$ and $b$. In this case we say the alphabet is the set $\{a, b\}$.
 
-There are many ways how we write string. Since they are lists of characters we might write
-them as {\it "hello"} being enclosed by double quotes. This is a short-hand for the list
+There are many ways how we can write down strings. In programming languages, they are usually 
+written as {\it "hello"} where the double quotes indicate that we dealing with a string. 
+Essentially, strings are lists of characters which can be written for example as follows
 
 \[
 [\text{\it h, e, l, l, o}]
 \]
 
 \noindent
-The important point is that we can always decompose strings. For example we often consider the
-first character of a string, say $h$, and the ``rest''  of a string {\it "ello"}. 
-There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list
-of characters $[\,]$. Two strings, say $s_1$ and $s_2$, can be \emph{concatenated} which is written
-as $s_1 @ s_2$. For example if we have the two strings {\it "foo"} and {\it "bar"}, their concatenation
+The important point is that we can always decompose strings. For example, we will often consider the
+first character of a string, say $h$, and the ``rest''  of a string say {\it "ello"} when making definitions 
+about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as 
+the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, 
+which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation
 gives {\it "foobar"}.
 
-We often need to talk about sets of strings. For example the set of all strings
+We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$
+is
 
 \[
 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
 \]
 
 \noindent
-Any set of strings, not just the set of all strings, is often called a \emph{language}. The idea behind
-this choice is that if we enumerate, say, all words/strings from a dictionary, like 
+Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind
+this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like 
 
 \[
 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}
@@ -48,18 +51,18 @@
 \noindent
 then we have essentially described the English language, or more precisely all
 strings that can be used in a sentence of the English language. French would be a
-different set of string, and so on. In the context of this course, a language might 
-not necessarily make sense from a natural language perspective. For example
-the set of all strings from above is a language, as is the empty set (of strings). The
+different set of strings, and so on. In the context of this course, a language might 
+not necessarily make sense from a natural language point of view. For example
+the set of all strings shown above is a language, as is the empty set (of strings). The
 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
-difference between the empty set, or empty language, and the set, or language, that 
-contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one 
-element.
+difference between the empty set, or empty language, and the set that 
+contains only the empty string $\{\text{""}\}$: the former has no elements, whereas 
+the latter has one element.
 
-As seen there are languages which contain infinitely many strings, like the set of all strings.
-The ``natural'' languages English, French and so on contain many but only finitely many 
-strings (the ones listed in a good dictionary). It might be therefore surprising that the
-language consisting of all email addresses is infinite if we assume it is defined by the
+As seen, there are languages which contain infinitely many strings, like the set of all strings.
+The ``natural'' languages like English, French and so on contain many but only finitely many 
+strings (namely the ones listed in a good dictionary). It might be therefore be surprising that the
+language consisting of all email addresses is infinite provided we assume it is defined by the
 regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
 
 \[
@@ -67,9 +70,9 @@
 \]
 
 \noindent
-The reason is that for example before the $@$-sign there can be any string you want if it 
-is made up from letters, digits, underscore, dot and hyphen---there are infinitely many
-of those. Similarly the string after the $@$-sign can be any string. This does not mean 
+The reason is that for example before the $@$-sign there can be any string you want assuming it 
+is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many
+of those. Similarly the string after the $@$-sign can be any string. However, this does not mean 
 that every string is an email address. For example
 
 \[
@@ -77,15 +80,15 @@
 \]
 
 \noindent
-is not, since the top-level-domains must be of length of at least two. Note that there is
+is not, because the top-level-domains must be of length of at least two. (Note that there is
 the convention that uppercase letters are treated in email-addresses as if they were
-lower-case.
+lower-case.)
 \bigskip
 
 Before we expand on the topic of regular expressions, let us review some operations on
-sets. We will use capital letters, such as $A$, $B$ and so on, to stand for sets of strings. 
+sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. 
 The union of two sets is written as usual as $A \cup B$. We also need to define the
-\emph{concatenation} of two sets (of strings). This can be defined as
+operation of \emph{concatenating} two sets of strings. This can be defined as
 
 \[
 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
@@ -93,8 +96,11 @@
 
 \noindent
 which essentially means take the first string from the set $A$ and concatenate it with every
-string in the set $B$, then take the second string from $A$ and so on. We also need to define
-the power of a set, written as $A^n$. This is defined inductively as follows
+string in the set $B$, then take the second string from $A$ do the same and so on. You might
+like to think about what this definition means in case $A$ or $B$ is the empty set.
+
+We also need to define
+the power of a set, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
 
 \begin{center}
 \begin{tabular}{rcl}
@@ -105,22 +111,23 @@
 
 \noindent
 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
-of every power of $A^n$ for every $n\ge 0$. The mathematical notation for this operation is
+of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is
 
 \[
 A^* \dn \bigcup_{0\le n} A^n
 \]
 
 \noindent
-This means the star of a set $A$ contains always the empty string, one copy of every string in $A$, two
-copies and so on. In case $A=\{"a\}$ we have 
+This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one 
+copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore 
+have 
 
 \[
 A^* = \{"", "a", "aa", "aaa", \ldots\}
 \]
 
 \noindent
-Be aware that these operations sometimes have non-intuitive properties, for example
+Be aware that these operations sometimes have quite non-intuitive properties, for example
 
 \begin{center}
 \begin{tabular}{ccc}
@@ -137,16 +144,16 @@
 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
 $\varnothing^*$ & $=$ & $\{""\}$\\
 $\{""\}^*$ & $=$ & $\{""\}$\\
-$A^1$ & $=$ & $A$\\
+$A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
 \end{tabular} 
 \end{tabular}
 \end{center}
 \bigskip
 
 \noindent
-\emph{Regular expressions} are meant for conveniently describing languages...at least languages
+\emph{Regular expressions} are meant to conveniently describe languages...at least languages
 we are interested in in Computer Science.  For example there is no convenient regular expression
-for describing the English language short of enumerating all English words/strings like in a dictionary. 
+for describing the English language short of enumerating all English words. 
 But they seem useful for describing all permitted email addresses, as seen above. 
 
 Regular expressions are given by the following grammar:
@@ -155,7 +162,7 @@
 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
   $r$ & $::=$ &   $\varnothing$         & null\\
         & $\mid$ & $\epsilon$              & empty string / "" / []\\
-        & $\mid$ & $c$                         & character\\
+        & $\mid$ & $c$                         & single character\\
         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
         & $\mid$ & $r_1 + r_2$            & alternative / choice\\
         & $\mid$ & $r^*$                      & star (zero or more)\\
@@ -163,15 +170,15 @@
 \end{center}
 
 \noindent
-There are some subtleties you should be aware of. The letter $c$ stands for any character from the
-alphabet. Second, we will use parentheses to disambiguate
-regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
+Because we overload our notation there are some subtleties you should be aware of. The letter $c$ stands for any character from the
+alphabet at hand. Second, we will use parentheses to disambiguate
+regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$.
 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
-$r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
-but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
+$r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$,
+but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
-$r_1 + r_2$  is written as $r_1\mid{}r_2$. In case of $\cdot$ we will even often omit it all together. For example
-the regular expression for email addresses is meant to be of the form
+$r_1 + r_2$  is written as $r_1\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even 
+often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form
 
 \[
 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
@@ -189,10 +196,10 @@
 \noindent
 but often just write {\it hello}.
 
-Another source of confusion might be that we use the term \emph{regular expressions} for the ones used in ``theory''
-and the ones in ``practice''. In this course by default we refer to the regular expressions defined by the grammar above. 
-In ``practice'' we often use $r^+$ for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ for an optional regular
-expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are mere convenience 
+Another source of confusion might arise from the fact that we use the term \emph{regular expressions} for the ones used in ``theory''
+and also the ones in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. 
+In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular
+expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience 
 as they can be seen as shorthand for
 
 \begin{center}
@@ -204,14 +211,16 @@
 \end{tabular}
 \end{center}
 
-\noindent
+
 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
-expression is supposed to stand for every string \emph{not} match by a regular expression. We will write
-such regular expressions as $~r$. While being ``convenience'' it is often not so clear what the shorthand for
-these kind of regular expressions is.
+expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write
+such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for
+these kind of not-regular-expressions is. Try to write down the regular expression which can match any
+string except {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include
+$\sim{}r$ in the definition or regular expressions. Whenever we do so, we will state it explicitly.
 
 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
-Formally we associate with every regular expression a set of strings which are matched by this
+Formally, we associate with every regular expression a set of strings which are matched by this
 regular expression. This can be formally defined as 
 
 \begin{center}