# HG changeset patch # User Christian Urban # Date 1380204685 -3600 # Node ID 52ee218151f95960789ad137a92592822cb25646 # Parent 1bdec6a9e03d8a7110e61fcb755ceede517dcd33 added diff -r 1bdec6a9e03d -r 52ee218151f9 handouts/ho01.pdf Binary file handouts/ho01.pdf has changed diff -r 1bdec6a9e03d -r 52ee218151f9 handouts/ho01.tex --- a/handouts/ho01.tex Thu Sep 26 13:15:05 2013 +0100 +++ b/handouts/ho01.tex Thu Sep 26 15:11:25 2013 +0100 @@ -5,6 +5,8 @@ \usepackage{amsmath} \usepackage[T1]{fontenc} +\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% + \begin{document} \section*{Handout 1} @@ -25,7 +27,9 @@ The important point is that we can always decompose strings. For example we often consider the first character of a string, say $h$, and the ``rest'' of a string {\it "ello"}. There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list -of characters $[\,]$. +of characters $[\,]$. Two strings, say $s_1$ and $s_2$, can be \emph{concatenated} which is written +as $s_1 @ s_2$. For example if we have the two strings {\it "foo"} and {\it "bar"}, their concatenation +gives {\it "foobar"}. We often need to talk about sets of strings. For example the set of all strings @@ -78,6 +82,67 @@ lower-case. \bigskip +Before we expand on the topic of regular expressions, let us review some operations on +sets. We will use capital letters, such as $A$, $B$ and so on, to stand for sets of strings. +The union of two sets is written as usual as $A \cup B$. We also need to define the +\emph{concatenation} of two sets (of strings). This can be defined as + +\[ +A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \} +\] + +\noindent +which essentially means take the first string from the set $A$ and concatenate it with every +string in the set $B$, then take the second string from $A$ and so on. We also need to define +the power of a set, written as $A^n$. This is defined inductively as follows + +\begin{center} +\begin{tabular}{rcl} +$A^0$ & $\dn$ & $\{[\,]\}$ \\ +$A^{n+1}$ & $\dn$ & $A @ A^n$\\ +\end{tabular} +\end{center} + +\noindent +Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union +of every power of $A^n$ for every $n\ge 0$. The mathematical notation for this operation is + +\[ +A^* \dn \bigcup_{0\le n} A^n +\] + +\noindent +This means the star of a set $A$ contains always the empty string, one copy of every string in $A$, two +copies and so on. In case $A=\{"a\}$ we have + +\[ +A^* = \{"", "a", "aa", "aaa", \ldots\} +\] + +\noindent +Be aware that these operations sometimes have non-intuitive properties, for example + +\begin{center} +\begin{tabular}{ccc} +\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} +$A \cup \varnothing$ & $=$ & $A$\\ +$A \cup A$ & $=$ & $A$\\ +$A \cup B$ & $=$ & $B \cup A$\\ +\end{tabular} & +\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} +$A @ B$ & $\not =$ & $B @ A$\\ +$A @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\ +$A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\ +\end{tabular} & +\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} +$\varnothing^*$ & $=$ & $\{""\}$\\ +$\{""\}^*$ & $=$ & $\{""\}$\\ +$A^1$ & $=$ & $A$\\ +\end{tabular} +\end{tabular} +\end{center} +\bigskip + \noindent \emph{Regular expressions} are meant for conveniently describing languages...at least languages we are interested in in Computer Science. For example there is no convenient regular expression @@ -98,7 +163,8 @@ \end{center} \noindent -There are some subtleties you should be aware of. First, we will use parentheses to disambiguate +There are some subtleties you should be aware of. The letter $c$ stands for any character from the +alphabet. Second, we will use parentheses to disambiguate regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$. The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$, @@ -113,9 +179,64 @@ \noindent meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then -a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. +a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. Similarly if +we want to specify the regular expression for the string {\it "hello"} we should write + +\[ +{\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} +\] + +\noindent +but often just write {\it hello}. + +Another source of confusion might be that we use the term \emph{regular expressions} for the ones used in ``theory'' +and the ones in ``practice''. In this course by default we refer to the regular expressions defined by the grammar above. +In ``practice'' we often use $r^+$ for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ for an optional regular +expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are mere convenience +as they can be seen as shorthand for + +\begin{center} +\begin{tabular}{rcl} +$r^+$ & $\mapsto$ & $r\cdot r^*$\\ +$r^?$ & $\mapsto$ & $\epsilon + r$\\ +$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ +$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ +\end{tabular} +\end{center} +\noindent +We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular +expression is supposed to stand for every string \emph{not} match by a regular expression. We will write +such regular expressions as $~r$. While being ``convenience'' it is often not so clear what the shorthand for +these kind of regular expressions is. +So far we have only considered informally what the \emph{meaning} of a regular expression is. +Formally we associate with every regular expression a set of strings which are matched by this +regular expression. This can be formally defined as + +\begin{center} +\begin{tabular}{rcl} +$L(\varnothing)$ & $\dn$ & $\{\,\}$\\ +$L(\epsilon)$ & $\dn$ & $\{""\}$\\ +$L(c)$ & $\dn$ & $\{"c"\}$\\ +$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ +$L(r_1 \cdot r_2)$ & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\ +$L(r^*)$ & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\ +\end{tabular} +\end{center} + +\noindent +This means we can now precisely state what the meaning, for example, of the regular expression +${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely +$L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$. Similarly if we have the choice +$a + b$, the meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly +be matched by this choice. + +The point of this definition is that we can now precisely specify when a string is matched by a +regular expression, namely a string, say $s$, is matched by a regular expression, say $r$, if +and only if $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and +any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, +if $s \not\in L(r)$. We leave this for the next lecture. \end{document} %%% Local Variables: