afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:397ecdafefd8
+:93bf3182cf71
 strings that can be used in a sentence of the English language. French would be a
 different set of string, and so on. In the context of this course, a language might
 not necessarily make sense from a natural language perspective. For example
 the set of all strings from above is a language, as is the empty set (of strings). The
 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a
-difference between the empty set $\{\,\}$ and the set that contains the empty string $\{\text{""}\}$.
+difference between the empty set, or empty language, and the set, or language, that
+contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one
+element.
+As seen there are languages which contain infinitely many strings, like the set of all strings.
+The ``natural'' languages English, French and so on contain many but only finitely many
+strings (the ones listed in a good dictionary). It might be therefore surprising that the
+language consisting of all email addresses is infinite if we assume it is defined by the
+regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
+\[
+([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
+\]
+\noindent
+The reason is that for example before the $@$-sign there can be any string you want if it
+is made up from letters, digits, underscore, dot and hyphen---there are infinitely many
+of those. Similarly the string after the $@$-sign can be any string. This does not mean
+that every string is an email address. For example
+\[
+\text{\it foo}@\text{\it bar}.\text{\it c}
+\]
+\noindent
+is not, since the top-level-domains must be of length of at least two. Note that there is
+the convention that uppercase letters are treated in email-addresses as if they were
+lower-case.
+\bigskip
+\noindent
+\emph{Regular expressions} are meant for conveniently describing languages...at least languages
+we are interested in in Computer Science.  For example there is no convenient regular expression
+for describing the English language short of enumerating all English words/strings like in a dictionary.
+But they seem useful for describing all permitted email addresses, as seen above.
+Regular expressions are given by the following grammar:
+\begin{center}
+\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
+$r$ & $::=$ &   $\varnothing$         & null\\
+& $\mid$ & $\epsilon$              & empty string / "" / []\\
+& $\mid$ & $c$                         & character\\
+& $\mid$ & $r_1 \cdot r_2$      & sequence\\
+& $\mid$ & $r_1 + r_2$            & alternative / choice\\
+& $\mid$ & $r^*$                      & star (zero or more)\\
+\end{tabular}
+\end{center}
+\noindent
+There are some subtleties you should be aware of. First, we will use parentheses to disambiguate
+regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
+The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
+$r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
+but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
+respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
+$r_1 + r_2$  is written as $r_1\mid{}r_2$
 \end{document}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: t

changeset 106	93bf3182cf71
parent 105	397ecdafefd8
child 107	1bdec6a9e03d