handouts/ho01.tex
changeset 106 93bf3182cf71
parent 105 397ecdafefd8
child 107 1bdec6a9e03d
equal deleted inserted replaced
105:397ecdafefd8 106:93bf3182cf71
    46 strings that can be used in a sentence of the English language. French would be a
    46 strings that can be used in a sentence of the English language. French would be a
    47 different set of string, and so on. In the context of this course, a language might 
    47 different set of string, and so on. In the context of this course, a language might 
    48 not necessarily make sense from a natural language perspective. For example
    48 not necessarily make sense from a natural language perspective. For example
    49 the set of all strings from above is a language, as is the empty set (of strings). The
    49 the set of all strings from above is a language, as is the empty set (of strings). The
    50 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
    50 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
    51 difference between the empty set $\{\,\}$ and the set that contains the empty string $\{\text{""}\}$.
    51 difference between the empty set, or empty language, and the set, or language, that 
       
    52 contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one 
       
    53 element.
    52 
    54 
       
    55 As seen there are languages which contain infinitely many strings, like the set of all strings.
       
    56 The ``natural'' languages English, French and so on contain many but only finitely many 
       
    57 strings (the ones listed in a good dictionary). It might be therefore surprising that the
       
    58 language consisting of all email addresses is infinite if we assume it is defined by the
       
    59 regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
       
    60 
       
    61 \[
       
    62 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
       
    63 \]
       
    64 
       
    65 \noindent
       
    66 The reason is that for example before the $@$-sign there can be any string you want if it 
       
    67 is made up from letters, digits, underscore, dot and hyphen---there are infinitely many
       
    68 of those. Similarly the string after the $@$-sign can be any string. This does not mean 
       
    69 that every string is an email address. For example
       
    70 
       
    71 \[
       
    72 \text{\it foo}@\text{\it bar}.\text{\it c}
       
    73 \]
       
    74 
       
    75 \noindent
       
    76 is not, since the top-level-domains must be of length of at least two. Note that there is
       
    77 the convention that uppercase letters are treated in email-addresses as if they were
       
    78 lower-case.
       
    79 \bigskip
       
    80 
       
    81 \noindent
       
    82 \emph{Regular expressions} are meant for conveniently describing languages...at least languages
       
    83 we are interested in in Computer Science.  For example there is no convenient regular expression
       
    84 for describing the English language short of enumerating all English words/strings like in a dictionary. 
       
    85 But they seem useful for describing all permitted email addresses, as seen above. 
       
    86 
       
    87 Regular expressions are given by the following grammar:
       
    88 
       
    89 \begin{center}
       
    90 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
       
    91   $r$ & $::=$ &   $\varnothing$         & null\\
       
    92         & $\mid$ & $\epsilon$              & empty string / "" / []\\
       
    93         & $\mid$ & $c$                         & character\\
       
    94         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
       
    95         & $\mid$ & $r_1 + r_2$            & alternative / choice\\
       
    96         & $\mid$ & $r^*$                      & star (zero or more)\\
       
    97   \end{tabular}
       
    98 \end{center}
       
    99 
       
   100 \noindent
       
   101 There are some subtleties you should be aware of. First, we will use parentheses to disambiguate
       
   102 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
       
   103 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
       
   104 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
       
   105 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
       
   106 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
       
   107 $r_1 + r_2$  is written as $r_1\mid{}r_2$ 
    53 \end{document}
   108 \end{document}
    54 
   109 
    55 %%% Local Variables: 
   110 %%% Local Variables: 
    56 %%% mode: latex
   111 %%% mode: latex
    57 %%% TeX-master: t
   112 %%% TeX-master: t