diff -r 397ecdafefd8 -r 93bf3182cf71 handouts/ho01.tex --- a/handouts/ho01.tex Thu Sep 26 11:50:31 2013 +0100 +++ b/handouts/ho01.tex Thu Sep 26 13:06:27 2013 +0100 @@ -48,8 +48,63 @@ not necessarily make sense from a natural language perspective. For example the set of all strings from above is a language, as is the empty set (of strings). The empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a -difference between the empty set $\{\,\}$ and the set that contains the empty string $\{\text{""}\}$. +difference between the empty set, or empty language, and the set, or language, that +contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one +element. + +As seen there are languages which contain infinitely many strings, like the set of all strings. +The ``natural'' languages English, French and so on contain many but only finitely many +strings (the ones listed in a good dictionary). It might be therefore surprising that the +language consisting of all email addresses is infinite if we assume it is defined by the +regular expression\footnote{See \url{http://goo.gl/5LoVX7}} + +\[ +([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}}) +\] + +\noindent +The reason is that for example before the $@$-sign there can be any string you want if it +is made up from letters, digits, underscore, dot and hyphen---there are infinitely many +of those. Similarly the string after the $@$-sign can be any string. This does not mean +that every string is an email address. For example + +\[ +\text{\it foo}@\text{\it bar}.\text{\it c} +\] +\noindent +is not, since the top-level-domains must be of length of at least two. Note that there is +the convention that uppercase letters are treated in email-addresses as if they were +lower-case. +\bigskip + +\noindent +\emph{Regular expressions} are meant for conveniently describing languages...at least languages +we are interested in in Computer Science. For example there is no convenient regular expression +for describing the English language short of enumerating all English words/strings like in a dictionary. +But they seem useful for describing all permitted email addresses, as seen above. + +Regular expressions are given by the following grammar: + +\begin{center} +\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} + $r$ & $::=$ & $\varnothing$ & null\\ + & $\mid$ & $\epsilon$ & empty string / "" / []\\ + & $\mid$ & $c$ & character\\ + & $\mid$ & $r_1 \cdot r_2$ & sequence\\ + & $\mid$ & $r_1 + r_2$ & alternative / choice\\ + & $\mid$ & $r^*$ & star (zero or more)\\ + \end{tabular} +\end{center} + +\noindent +There are some subtleties you should be aware of. First, we will use parentheses to disambiguate +regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$. +The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times +$r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$, +but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, +respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice +$r_1 + r_2$ is written as $r_1\mid{}r_2$ \end{document} %%% Local Variables: