handouts/ho01.tex
changeset 108 52ee218151f9
parent 107 1bdec6a9e03d
child 110 9353308f1c6a
equal deleted inserted replaced
107:1bdec6a9e03d 108:52ee218151f9
     3 \usepackage{hyperref}
     3 \usepackage{hyperref}
     4 \usepackage{amssymb}
     4 \usepackage{amssymb}
     5 \usepackage{amsmath}
     5 \usepackage{amsmath}
     6 \usepackage[T1]{fontenc}
     6 \usepackage[T1]{fontenc}
     7 
     7 
       
     8 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
     9 
     8 \begin{document}
    10 \begin{document}
     9 
    11 
    10 \section*{Handout 1}
    12 \section*{Handout 1}
    11 
    13 
    12 This course is about processing of strings. Lets start with what we mean by \emph{string}. Strings
    14 This course is about processing of strings. Lets start with what we mean by \emph{string}. Strings
    23 
    25 
    24 \noindent
    26 \noindent
    25 The important point is that we can always decompose strings. For example we often consider the
    27 The important point is that we can always decompose strings. For example we often consider the
    26 first character of a string, say $h$, and the ``rest''  of a string {\it "ello"}. 
    28 first character of a string, say $h$, and the ``rest''  of a string {\it "ello"}. 
    27 There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list
    29 There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list
    28 of characters $[\,]$. 
    30 of characters $[\,]$. Two strings, say $s_1$ and $s_2$, can be \emph{concatenated} which is written
       
    31 as $s_1 @ s_2$. For example if we have the two strings {\it "foo"} and {\it "bar"}, their concatenation
       
    32 gives {\it "foobar"}.
    29 
    33 
    30 We often need to talk about sets of strings. For example the set of all strings
    34 We often need to talk about sets of strings. For example the set of all strings
    31 
    35 
    32 \[
    36 \[
    33 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
    37 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
    76 is not, since the top-level-domains must be of length of at least two. Note that there is
    80 is not, since the top-level-domains must be of length of at least two. Note that there is
    77 the convention that uppercase letters are treated in email-addresses as if they were
    81 the convention that uppercase letters are treated in email-addresses as if they were
    78 lower-case.
    82 lower-case.
    79 \bigskip
    83 \bigskip
    80 
    84 
       
    85 Before we expand on the topic of regular expressions, let us review some operations on
       
    86 sets. We will use capital letters, such as $A$, $B$ and so on, to stand for sets of strings. 
       
    87 The union of two sets is written as usual as $A \cup B$. We also need to define the
       
    88 \emph{concatenation} of two sets (of strings). This can be defined as
       
    89 
       
    90 \[
       
    91 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
       
    92 \]
       
    93 
       
    94 \noindent
       
    95 which essentially means take the first string from the set $A$ and concatenate it with every
       
    96 string in the set $B$, then take the second string from $A$ and so on. We also need to define
       
    97 the power of a set, written as $A^n$. This is defined inductively as follows
       
    98 
       
    99 \begin{center}
       
   100 \begin{tabular}{rcl}
       
   101 $A^0$ & $\dn$ & $\{[\,]\}$ \\
       
   102 $A^{n+1}$ & $\dn$ & $A @ A^n$\\
       
   103 \end{tabular}
       
   104 \end{center}
       
   105 
       
   106 \noindent
       
   107 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
       
   108 of every power of $A^n$ for every $n\ge 0$. The mathematical notation for this operation is
       
   109 
       
   110 \[
       
   111 A^* \dn \bigcup_{0\le n} A^n
       
   112 \]
       
   113 
       
   114 \noindent
       
   115 This means the star of a set $A$ contains always the empty string, one copy of every string in $A$, two
       
   116 copies and so on. In case $A=\{"a\}$ we have 
       
   117 
       
   118 \[
       
   119 A^* = \{"", "a", "aa", "aaa", \ldots\}
       
   120 \]
       
   121 
       
   122 \noindent
       
   123 Be aware that these operations sometimes have non-intuitive properties, for example
       
   124 
       
   125 \begin{center}
       
   126 \begin{tabular}{ccc}
       
   127 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   128 $A \cup \varnothing$ & $=$ & $A$\\
       
   129 $A \cup A$ & $=$ & $A$\\
       
   130 $A \cup B$ & $=$ & $B \cup A$\\
       
   131 \end{tabular} &
       
   132 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   133 $A @ B$ & $\not =$ & $B @ A$\\
       
   134 $A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
       
   135 $A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
       
   136 \end{tabular} &
       
   137 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   138 $\varnothing^*$ & $=$ & $\{""\}$\\
       
   139 $\{""\}^*$ & $=$ & $\{""\}$\\
       
   140 $A^1$ & $=$ & $A$\\
       
   141 \end{tabular} 
       
   142 \end{tabular}
       
   143 \end{center}
       
   144 \bigskip
       
   145 
    81 \noindent
   146 \noindent
    82 \emph{Regular expressions} are meant for conveniently describing languages...at least languages
   147 \emph{Regular expressions} are meant for conveniently describing languages...at least languages
    83 we are interested in in Computer Science.  For example there is no convenient regular expression
   148 we are interested in in Computer Science.  For example there is no convenient regular expression
    84 for describing the English language short of enumerating all English words/strings like in a dictionary. 
   149 for describing the English language short of enumerating all English words/strings like in a dictionary. 
    85 But they seem useful for describing all permitted email addresses, as seen above. 
   150 But they seem useful for describing all permitted email addresses, as seen above. 
    96         & $\mid$ & $r^*$                      & star (zero or more)\\
   161         & $\mid$ & $r^*$                      & star (zero or more)\\
    97   \end{tabular}
   162   \end{tabular}
    98 \end{center}
   163 \end{center}
    99 
   164 
   100 \noindent
   165 \noindent
   101 There are some subtleties you should be aware of. First, we will use parentheses to disambiguate
   166 There are some subtleties you should be aware of. The letter $c$ stands for any character from the
       
   167 alphabet. Second, we will use parentheses to disambiguate
   102 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
   168 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
   103 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
   169 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
   104 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
   170 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
   105 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
   171 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
   106 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
   172 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
   111 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
   177 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
   112 \]
   178 \]
   113 
   179 
   114 \noindent
   180 \noindent
   115 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
   181 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
   116 a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain.
   182 a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. Similarly if
   117 
   183 we want to specify the regular expression for the string {\it "hello"} we should write
   118 
   184 
       
   185 \[
       
   186 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
       
   187 \]
       
   188 
       
   189 \noindent
       
   190 but often just write {\it hello}.
       
   191 
       
   192 Another source of confusion might be that we use the term \emph{regular expressions} for the ones used in ``theory''
       
   193 and the ones in ``practice''. In this course by default we refer to the regular expressions defined by the grammar above. 
       
   194 In ``practice'' we often use $r^+$ for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ for an optional regular
       
   195 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are mere convenience 
       
   196 as they can be seen as shorthand for
       
   197 
       
   198 \begin{center}
       
   199 \begin{tabular}{rcl}
       
   200 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
       
   201 $r^?$ & $\mapsto$ & $\epsilon + r$\\
       
   202 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
       
   203 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
       
   204 \end{tabular}
       
   205 \end{center}
       
   206 
       
   207 \noindent
       
   208 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
       
   209 expression is supposed to stand for every string \emph{not} match by a regular expression. We will write
       
   210 such regular expressions as $~r$. While being ``convenience'' it is often not so clear what the shorthand for
       
   211 these kind of regular expressions is.
       
   212 
       
   213 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
       
   214 Formally we associate with every regular expression a set of strings which are matched by this
       
   215 regular expression. This can be formally defined as 
       
   216 
       
   217 \begin{center}
       
   218 \begin{tabular}{rcl}
       
   219 $L(\varnothing)$  & $\dn$ & $\{\,\}$\\
       
   220 $L(\epsilon)$       & $\dn$ & $\{""\}$\\
       
   221 $L(c)$                  & $\dn$ & $\{"c"\}$\\
       
   222 $L(r_1+ r_2)$      & $\dn$ & $L(r_1) \cup L(r_2)$\\
       
   223 $L(r_1 \cdot r_2)$  & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\
       
   224 $L(r^*)$                   & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\
       
   225 \end{tabular}
       
   226 \end{center}
       
   227 
       
   228 \noindent
       
   229 This means we can now precisely state what the meaning, for example, of the regular expression 
       
   230 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
       
   231 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$. Similarly if we have the choice
       
   232 $a + b$, the meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
       
   233 be matched by this choice. 
       
   234 
       
   235 The point of this definition is that we can now precisely specify when a string is matched by a
       
   236 regular expression, namely a string, say $s$, is matched by a regular expression, say $r$, if
       
   237 and only if $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
       
   238 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
       
   239 if $s \not\in L(r)$. We leave this for the next lecture.
   119 \end{document}
   240 \end{document}
   120 
   241 
   121 %%% Local Variables: 
   242 %%% Local Variables: 
   122 %%% mode: latex
   243 %%% mode: latex
   123 %%% TeX-master: t
   244 %%% TeX-master: t