handouts/ho01.tex
changeset 108 52ee218151f9
parent 107 1bdec6a9e03d
child 110 9353308f1c6a
--- a/handouts/ho01.tex	Thu Sep 26 13:15:05 2013 +0100
+++ b/handouts/ho01.tex	Thu Sep 26 15:11:25 2013 +0100
@@ -5,6 +5,8 @@
 \usepackage{amsmath}
 \usepackage[T1]{fontenc}
 
+\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+
 \begin{document}
 
 \section*{Handout 1}
@@ -25,7 +27,9 @@
 The important point is that we can always decompose strings. For example we often consider the
 first character of a string, say $h$, and the ``rest''  of a string {\it "ello"}. 
 There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list
-of characters $[\,]$. 
+of characters $[\,]$. Two strings, say $s_1$ and $s_2$, can be \emph{concatenated} which is written
+as $s_1 @ s_2$. For example if we have the two strings {\it "foo"} and {\it "bar"}, their concatenation
+gives {\it "foobar"}.
 
 We often need to talk about sets of strings. For example the set of all strings
 
@@ -78,6 +82,67 @@
 lower-case.
 \bigskip
 
+Before we expand on the topic of regular expressions, let us review some operations on
+sets. We will use capital letters, such as $A$, $B$ and so on, to stand for sets of strings. 
+The union of two sets is written as usual as $A \cup B$. We also need to define the
+\emph{concatenation} of two sets (of strings). This can be defined as
+
+\[
+A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
+\]
+
+\noindent
+which essentially means take the first string from the set $A$ and concatenate it with every
+string in the set $B$, then take the second string from $A$ and so on. We also need to define
+the power of a set, written as $A^n$. This is defined inductively as follows
+
+\begin{center}
+\begin{tabular}{rcl}
+$A^0$ & $\dn$ & $\{[\,]\}$ \\
+$A^{n+1}$ & $\dn$ & $A @ A^n$\\
+\end{tabular}
+\end{center}
+
+\noindent
+Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
+of every power of $A^n$ for every $n\ge 0$. The mathematical notation for this operation is
+
+\[
+A^* \dn \bigcup_{0\le n} A^n
+\]
+
+\noindent
+This means the star of a set $A$ contains always the empty string, one copy of every string in $A$, two
+copies and so on. In case $A=\{"a\}$ we have 
+
+\[
+A^* = \{"", "a", "aa", "aaa", \ldots\}
+\]
+
+\noindent
+Be aware that these operations sometimes have non-intuitive properties, for example
+
+\begin{center}
+\begin{tabular}{ccc}
+\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$A \cup \varnothing$ & $=$ & $A$\\
+$A \cup A$ & $=$ & $A$\\
+$A \cup B$ & $=$ & $B \cup A$\\
+\end{tabular} &
+\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$A @ B$ & $\not =$ & $B @ A$\\
+$A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
+$A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
+\end{tabular} &
+\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\varnothing^*$ & $=$ & $\{""\}$\\
+$\{""\}^*$ & $=$ & $\{""\}$\\
+$A^1$ & $=$ & $A$\\
+\end{tabular} 
+\end{tabular}
+\end{center}
+\bigskip
+
 \noindent
 \emph{Regular expressions} are meant for conveniently describing languages...at least languages
 we are interested in in Computer Science.  For example there is no convenient regular expression
@@ -98,7 +163,8 @@
 \end{center}
 
 \noindent
-There are some subtleties you should be aware of. First, we will use parentheses to disambiguate
+There are some subtleties you should be aware of. The letter $c$ stands for any character from the
+alphabet. Second, we will use parentheses to disambiguate
 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
@@ -113,9 +179,64 @@
 
 \noindent
 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
-a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain.
+a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. Similarly if
+we want to specify the regular expression for the string {\it "hello"} we should write
+
+\[
+{\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
+\]
+
+\noindent
+but often just write {\it hello}.
+
+Another source of confusion might be that we use the term \emph{regular expressions} for the ones used in ``theory''
+and the ones in ``practice''. In this course by default we refer to the regular expressions defined by the grammar above. 
+In ``practice'' we often use $r^+$ for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ for an optional regular
+expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are mere convenience 
+as they can be seen as shorthand for
+
+\begin{center}
+\begin{tabular}{rcl}
+$r^+$ & $\mapsto$ & $r\cdot r^*$\\
+$r^?$ & $\mapsto$ & $\epsilon + r$\\
+$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
+$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
+\end{tabular}
+\end{center}
 
+\noindent
+We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
+expression is supposed to stand for every string \emph{not} match by a regular expression. We will write
+such regular expressions as $~r$. While being ``convenience'' it is often not so clear what the shorthand for
+these kind of regular expressions is.
 
+So far we have only considered informally what the \emph{meaning} of a regular expression is.  
+Formally we associate with every regular expression a set of strings which are matched by this
+regular expression. This can be formally defined as 
+
+\begin{center}
+\begin{tabular}{rcl}
+$L(\varnothing)$  & $\dn$ & $\{\,\}$\\
+$L(\epsilon)$       & $\dn$ & $\{""\}$\\
+$L(c)$                  & $\dn$ & $\{"c"\}$\\
+$L(r_1+ r_2)$      & $\dn$ & $L(r_1) \cup L(r_2)$\\
+$L(r_1 \cdot r_2)$  & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\
+$L(r^*)$                   & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\
+\end{tabular}
+\end{center}
+
+\noindent
+This means we can now precisely state what the meaning, for example, of the regular expression 
+${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
+$L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$. Similarly if we have the choice
+$a + b$, the meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
+be matched by this choice. 
+
+The point of this definition is that we can now precisely specify when a string is matched by a
+regular expression, namely a string, say $s$, is matched by a regular expression, say $r$, if
+and only if $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
+any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
+if $s \not\in L(r)$. We leave this for the next lecture.
 \end{document}
 
 %%% Local Variables: