handouts/ho01.tex
changeset 106 93bf3182cf71
parent 105 397ecdafefd8
child 107 1bdec6a9e03d
--- a/handouts/ho01.tex	Thu Sep 26 11:50:31 2013 +0100
+++ b/handouts/ho01.tex	Thu Sep 26 13:06:27 2013 +0100
@@ -48,8 +48,63 @@
 not necessarily make sense from a natural language perspective. For example
 the set of all strings from above is a language, as is the empty set (of strings). The
 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
-difference between the empty set $\{\,\}$ and the set that contains the empty string $\{\text{""}\}$.
+difference between the empty set, or empty language, and the set, or language, that 
+contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one 
+element.
+
+As seen there are languages which contain infinitely many strings, like the set of all strings.
+The ``natural'' languages English, French and so on contain many but only finitely many 
+strings (the ones listed in a good dictionary). It might be therefore surprising that the
+language consisting of all email addresses is infinite if we assume it is defined by the
+regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
+
+\[
+([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
+\]
+
+\noindent
+The reason is that for example before the $@$-sign there can be any string you want if it 
+is made up from letters, digits, underscore, dot and hyphen---there are infinitely many
+of those. Similarly the string after the $@$-sign can be any string. This does not mean 
+that every string is an email address. For example
+
+\[
+\text{\it foo}@\text{\it bar}.\text{\it c}
+\]
 
+\noindent
+is not, since the top-level-domains must be of length of at least two. Note that there is
+the convention that uppercase letters are treated in email-addresses as if they were
+lower-case.
+\bigskip
+
+\noindent
+\emph{Regular expressions} are meant for conveniently describing languages...at least languages
+we are interested in in Computer Science.  For example there is no convenient regular expression
+for describing the English language short of enumerating all English words/strings like in a dictionary. 
+But they seem useful for describing all permitted email addresses, as seen above. 
+
+Regular expressions are given by the following grammar:
+
+\begin{center}
+\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
+  $r$ & $::=$ &   $\varnothing$         & null\\
+        & $\mid$ & $\epsilon$              & empty string / "" / []\\
+        & $\mid$ & $c$                         & character\\
+        & $\mid$ & $r_1 \cdot r_2$      & sequence\\
+        & $\mid$ & $r_1 + r_2$            & alternative / choice\\
+        & $\mid$ & $r^*$                      & star (zero or more)\\
+  \end{tabular}
+\end{center}
+
+\noindent
+There are some subtleties you should be aware of. First, we will use parentheses to disambiguate
+regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
+The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
+$r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
+but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
+respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
+$r_1 + r_2$  is written as $r_1\mid{}r_2$ 
 \end{document}
 
 %%% Local Variables: