afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:1bdec6a9e03d
+:52ee218151f9
 \usepackage{hyperref}
 \usepackage{amssymb}
 \usepackage{amsmath}
 \usepackage[T1]{fontenc}
+\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
 \begin{document}
 \section*{Handout 1}
 This course is about processing of strings. Lets start with what we mean by \emph{string}. Strings
 \noindent
 The important point is that we can always decompose strings. For example we often consider the
 first character of a string, say $h$, and the ``rest''  of a string {\it "ello"}.
 There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list
-of characters $[\,]$.
+of characters $[\,]$. Two strings, say $s_1$ and $s_2$, can be \emph{concatenated} which is written
+as $s_1 @ s_2$. For example if we have the two strings {\it "foo"} and {\it "bar"}, their concatenation
+gives {\it "foobar"}.
 We often need to talk about sets of strings. For example the set of all strings
 \[
 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
 is not, since the top-level-domains must be of length of at least two. Note that there is
 the convention that uppercase letters are treated in email-addresses as if they were
 lower-case.
 \bigskip
+Before we expand on the topic of regular expressions, let us review some operations on
+sets. We will use capital letters, such as $A$, $B$ and so on, to stand for sets of strings.
+The union of two sets is written as usual as $A \cup B$. We also need to define the
+\emph{concatenation} of two sets (of strings). This can be defined as
+\[
+A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
+\]
+\noindent
+which essentially means take the first string from the set $A$ and concatenate it with every
+string in the set $B$, then take the second string from $A$ and so on. We also need to define
+the power of a set, written as $A^n$. This is defined inductively as follows
+\begin{center}
+\begin{tabular}{rcl}
+$A^0$ & $\dn$ & $\{[\,]\}$ \\
+$A^{n+1}$ & $\dn$ & $A @ A^n$\\
+\end{tabular}
+\end{center}
+\noindent
+Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
+of every power of $A^n$ for every $n\ge 0$. The mathematical notation for this operation is
+\[
+A^* \dn \bigcup_{0\le n} A^n
+\]
+\noindent
+This means the star of a set $A$ contains always the empty string, one copy of every string in $A$, two
+copies and so on. In case $A=\{"a\}$ we have
+\[
+A^* = \{"", "a", "aa", "aaa", \ldots\}
+\]
+\noindent
+Be aware that these operations sometimes have non-intuitive properties, for example
+\begin{center}
+\begin{tabular}{ccc}
+\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$A \cup \varnothing$ & $=$ & $A$\\
+$A \cup A$ & $=$ & $A$\\
+$A \cup B$ & $=$ & $B \cup A$\\
+\end{tabular} &
+\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$A @ B$ & $\not =$ & $B @ A$\\
+$A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
+$A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
+\end{tabular} &
+\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\varnothing^*$ & $=$ & $\{""\}$\\
+$\{""\}^*$ & $=$ & $\{""\}$\\
+$A^1$ & $=$ & $A$\\
+\end{tabular}
+\end{tabular}
+\end{center}
+\bigskip
 \noindent
 \emph{Regular expressions} are meant for conveniently describing languages...at least languages
 we are interested in in Computer Science.  For example there is no convenient regular expression
 for describing the English language short of enumerating all English words/strings like in a dictionary.
 But they seem useful for describing all permitted email addresses, as seen above.
 & $\mid$ & $r^*$                      & star (zero or more)\\
 \end{tabular}
 \end{center}
 \noindent
-There are some subtleties you should be aware of. First, we will use parentheses to disambiguate
+There are some subtleties you should be aware of. The letter $c$ stands for any character from the
+alphabet. Second, we will use parentheses to disambiguate
 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
 \]
 \noindent
 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
-a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain.
+a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. Similarly if
+we want to specify the regular expression for the string {\it "hello"} we should write
+\[
+{\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
+\]
+\noindent
+but often just write {\it hello}.
+Another source of confusion might be that we use the term \emph{regular expressions} for the ones used in ``theory''
+and the ones in ``practice''. In this course by default we refer to the regular expressions defined by the grammar above.
+In ``practice'' we often use $r^+$ for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ for an optional regular
+expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are mere convenience
+as they can be seen as shorthand for
+\begin{center}
+\begin{tabular}{rcl}
+$r^+$ & $\mapsto$ & $r\cdot r^*$\\
+$r^?$ & $\mapsto$ & $\epsilon + r$\\
+$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
+$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
+\end{tabular}
+\end{center}
+\noindent
+We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
+expression is supposed to stand for every string \emph{not} match by a regular expression. We will write
+such regular expressions as $~r$. While being ``convenience'' it is often not so clear what the shorthand for
+these kind of regular expressions is.
+So far we have only considered informally what the \emph{meaning} of a regular expression is.
+Formally we associate with every regular expression a set of strings which are matched by this
+regular expression. This can be formally defined as
+\begin{center}
+\begin{tabular}{rcl}
+$L(\varnothing)$  & $\dn$ & $\{\,\}$\\
+$L(\epsilon)$       & $\dn$ & $\{""\}$\\
+$L(c)$                  & $\dn$ & $\{"c"\}$\\
+$L(r_1+ r_2)$      & $\dn$ & $L(r_1) \cup L(r_2)$\\
+$L(r_1 \cdot r_2)$  & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\
+$L(r^*)$                   & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\
+\end{tabular}
+\end{center}
+\noindent
+This means we can now precisely state what the meaning, for example, of the regular expression
+${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely
+$L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$. Similarly if we have the choice
+$a + b$, the meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
+be matched by this choice.
+The point of this definition is that we can now precisely specify when a string is matched by a
+regular expression, namely a string, say $s$, is matched by a regular expression, say $r$, if
+and only if $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
+any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
+if $s \not\in L(r)$. We leave this for the next lecture.
 \end{document}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: t

changeset 108	52ee218151f9
parent 107	1bdec6a9e03d
child 110	9353308f1c6a