Binary file handouts/ho01.pdf has changed
--- a/handouts/ho01.tex Thu Sep 26 11:50:31 2013 +0100
+++ b/handouts/ho01.tex Thu Sep 26 13:06:27 2013 +0100
@@ -48,8 +48,63 @@
not necessarily make sense from a natural language perspective. For example
the set of all strings from above is a language, as is the empty set (of strings). The
empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a
-difference between the empty set $\{\,\}$ and the set that contains the empty string $\{\text{""}\}$.
+difference between the empty set, or empty language, and the set, or language, that
+contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one
+element.
+
+As seen there are languages which contain infinitely many strings, like the set of all strings.
+The ``natural'' languages English, French and so on contain many but only finitely many
+strings (the ones listed in a good dictionary). It might be therefore surprising that the
+language consisting of all email addresses is infinite if we assume it is defined by the
+regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
+
+\[
+([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
+\]
+
+\noindent
+The reason is that for example before the $@$-sign there can be any string you want if it
+is made up from letters, digits, underscore, dot and hyphen---there are infinitely many
+of those. Similarly the string after the $@$-sign can be any string. This does not mean
+that every string is an email address. For example
+
+\[
+\text{\it foo}@\text{\it bar}.\text{\it c}
+\]
+\noindent
+is not, since the top-level-domains must be of length of at least two. Note that there is
+the convention that uppercase letters are treated in email-addresses as if they were
+lower-case.
+\bigskip
+
+\noindent
+\emph{Regular expressions} are meant for conveniently describing languages...at least languages
+we are interested in in Computer Science. For example there is no convenient regular expression
+for describing the English language short of enumerating all English words/strings like in a dictionary.
+But they seem useful for describing all permitted email addresses, as seen above.
+
+Regular expressions are given by the following grammar:
+
+\begin{center}
+\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
+ $r$ & $::=$ & $\varnothing$ & null\\
+ & $\mid$ & $\epsilon$ & empty string / "" / []\\
+ & $\mid$ & $c$ & character\\
+ & $\mid$ & $r_1 \cdot r_2$ & sequence\\
+ & $\mid$ & $r_1 + r_2$ & alternative / choice\\
+ & $\mid$ & $r^*$ & star (zero or more)\\
+ \end{tabular}
+\end{center}
+
+\noindent
+There are some subtleties you should be aware of. First, we will use parentheses to disambiguate
+regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
+The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
+$r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
+but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
+respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
+$r_1 + r_2$ is written as $r_1\mid{}r_2$
\end{document}
%%% Local Variables: