handouts/ho01.tex
changeset 110 9353308f1c6a
parent 108 52ee218151f9
child 111 1933e88cb73e
equal deleted inserted replaced
109:f2a90dda7e3b 110:9353308f1c6a
     9 
     9 
    10 \begin{document}
    10 \begin{document}
    11 
    11 
    12 \section*{Handout 1}
    12 \section*{Handout 1}
    13 
    13 
    14 This course is about processing of strings. Lets start with what we mean by \emph{string}. Strings
    14 This course is about the processing of strings. Lets start with what we mean by \emph{strings}. Strings
    15 are lists of characters drawn from an \emph{alphabet}. If nothing else is specified, we usually assume 
    15 (they are also sometimes referred to as \emph{words}) are lists of characters drawn from an \emph{alphabet}. 
    16 the alphabet are letters $a$, $b$, \ldots, $z$ and $A$, $B$, \ldots $Z$. Sometimes we explicitly
    16 If nothing else is specified, we usually assume 
    17 restrict strings to only contain the letters $a$ and $b$. Then we say the alphabet is the set $\{a, b\}$.
    17 the alphabet consists of just the lower-case letters $a$, $b$, \ldots, $z$. Sometimes, however, we explicitly
    18 
    18 restrict strings to contain, for example, only the letters $a$ and $b$. In this case we say the alphabet is the set $\{a, b\}$.
    19 There are many ways how we write string. Since they are lists of characters we might write
    19 
    20 them as {\it "hello"} being enclosed by double quotes. This is a short-hand for the list
    20 There are many ways how we can write down strings. In programming languages, they are usually 
       
    21 written as {\it "hello"} where the double quotes indicate that we dealing with a string. 
       
    22 Essentially, strings are lists of characters which can be written for example as follows
    21 
    23 
    22 \[
    24 \[
    23 [\text{\it h, e, l, l, o}]
    25 [\text{\it h, e, l, l, o}]
    24 \]
    26 \]
    25 
    27 
    26 \noindent
    28 \noindent
    27 The important point is that we can always decompose strings. For example we often consider the
    29 The important point is that we can always decompose strings. For example, we will often consider the
    28 first character of a string, say $h$, and the ``rest''  of a string {\it "ello"}. 
    30 first character of a string, say $h$, and the ``rest''  of a string say {\it "ello"} when making definitions 
    29 There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list
    31 about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as 
    30 of characters $[\,]$. Two strings, say $s_1$ and $s_2$, can be \emph{concatenated} which is written
    32 the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, 
    31 as $s_1 @ s_2$. For example if we have the two strings {\it "foo"} and {\it "bar"}, their concatenation
    33 which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation
    32 gives {\it "foobar"}.
    34 gives {\it "foobar"}.
    33 
    35 
    34 We often need to talk about sets of strings. For example the set of all strings
    36 We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$
       
    37 is
    35 
    38 
    36 \[
    39 \[
    37 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
    40 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
    38 \]
    41 \]
    39 
    42 
    40 \noindent
    43 \noindent
    41 Any set of strings, not just the set of all strings, is often called a \emph{language}. The idea behind
    44 Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind
    42 this choice is that if we enumerate, say, all words/strings from a dictionary, like 
    45 this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like 
    43 
    46 
    44 \[
    47 \[
    45 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}
    48 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}
    46 \]
    49 \]
    47 
    50 
    48 \noindent
    51 \noindent
    49 then we have essentially described the English language, or more precisely all
    52 then we have essentially described the English language, or more precisely all
    50 strings that can be used in a sentence of the English language. French would be a
    53 strings that can be used in a sentence of the English language. French would be a
    51 different set of string, and so on. In the context of this course, a language might 
    54 different set of strings, and so on. In the context of this course, a language might 
    52 not necessarily make sense from a natural language perspective. For example
    55 not necessarily make sense from a natural language point of view. For example
    53 the set of all strings from above is a language, as is the empty set (of strings). The
    56 the set of all strings shown above is a language, as is the empty set (of strings). The
    54 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
    57 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
    55 difference between the empty set, or empty language, and the set, or language, that 
    58 difference between the empty set, or empty language, and the set that 
    56 contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one 
    59 contains only the empty string $\{\text{""}\}$: the former has no elements, whereas 
    57 element.
    60 the latter has one element.
    58 
    61 
    59 As seen there are languages which contain infinitely many strings, like the set of all strings.
    62 As seen, there are languages which contain infinitely many strings, like the set of all strings.
    60 The ``natural'' languages English, French and so on contain many but only finitely many 
    63 The ``natural'' languages like English, French and so on contain many but only finitely many 
    61 strings (the ones listed in a good dictionary). It might be therefore surprising that the
    64 strings (namely the ones listed in a good dictionary). It might be therefore be surprising that the
    62 language consisting of all email addresses is infinite if we assume it is defined by the
    65 language consisting of all email addresses is infinite provided we assume it is defined by the
    63 regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
    66 regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
    64 
    67 
    65 \[
    68 \[
    66 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
    69 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
    67 \]
    70 \]
    68 
    71 
    69 \noindent
    72 \noindent
    70 The reason is that for example before the $@$-sign there can be any string you want if it 
    73 The reason is that for example before the $@$-sign there can be any string you want assuming it 
    71 is made up from letters, digits, underscore, dot and hyphen---there are infinitely many
    74 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many
    72 of those. Similarly the string after the $@$-sign can be any string. This does not mean 
    75 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean 
    73 that every string is an email address. For example
    76 that every string is an email address. For example
    74 
    77 
    75 \[
    78 \[
    76 \text{\it foo}@\text{\it bar}.\text{\it c}
    79 \text{\it foo}@\text{\it bar}.\text{\it c}
    77 \]
    80 \]
    78 
    81 
    79 \noindent
    82 \noindent
    80 is not, since the top-level-domains must be of length of at least two. Note that there is
    83 is not, because the top-level-domains must be of length of at least two. (Note that there is
    81 the convention that uppercase letters are treated in email-addresses as if they were
    84 the convention that uppercase letters are treated in email-addresses as if they were
    82 lower-case.
    85 lower-case.)
    83 \bigskip
    86 \bigskip
    84 
    87 
    85 Before we expand on the topic of regular expressions, let us review some operations on
    88 Before we expand on the topic of regular expressions, let us review some operations on
    86 sets. We will use capital letters, such as $A$, $B$ and so on, to stand for sets of strings. 
    89 sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. 
    87 The union of two sets is written as usual as $A \cup B$. We also need to define the
    90 The union of two sets is written as usual as $A \cup B$. We also need to define the
    88 \emph{concatenation} of two sets (of strings). This can be defined as
    91 operation of \emph{concatenating} two sets of strings. This can be defined as
    89 
    92 
    90 \[
    93 \[
    91 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
    94 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
    92 \]
    95 \]
    93 
    96 
    94 \noindent
    97 \noindent
    95 which essentially means take the first string from the set $A$ and concatenate it with every
    98 which essentially means take the first string from the set $A$ and concatenate it with every
    96 string in the set $B$, then take the second string from $A$ and so on. We also need to define
    99 string in the set $B$, then take the second string from $A$ do the same and so on. You might
    97 the power of a set, written as $A^n$. This is defined inductively as follows
   100 like to think about what this definition means in case $A$ or $B$ is the empty set.
       
   101 
       
   102 We also need to define
       
   103 the power of a set, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
    98 
   104 
    99 \begin{center}
   105 \begin{center}
   100 \begin{tabular}{rcl}
   106 \begin{tabular}{rcl}
   101 $A^0$ & $\dn$ & $\{[\,]\}$ \\
   107 $A^0$ & $\dn$ & $\{[\,]\}$ \\
   102 $A^{n+1}$ & $\dn$ & $A @ A^n$\\
   108 $A^{n+1}$ & $\dn$ & $A @ A^n$\\
   103 \end{tabular}
   109 \end{tabular}
   104 \end{center}
   110 \end{center}
   105 
   111 
   106 \noindent
   112 \noindent
   107 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
   113 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
   108 of every power of $A^n$ for every $n\ge 0$. The mathematical notation for this operation is
   114 of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is
   109 
   115 
   110 \[
   116 \[
   111 A^* \dn \bigcup_{0\le n} A^n
   117 A^* \dn \bigcup_{0\le n} A^n
   112 \]
   118 \]
   113 
   119 
   114 \noindent
   120 \noindent
   115 This means the star of a set $A$ contains always the empty string, one copy of every string in $A$, two
   121 This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one 
   116 copies and so on. In case $A=\{"a\}$ we have 
   122 copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore 
       
   123 have 
   117 
   124 
   118 \[
   125 \[
   119 A^* = \{"", "a", "aa", "aaa", \ldots\}
   126 A^* = \{"", "a", "aa", "aaa", \ldots\}
   120 \]
   127 \]
   121 
   128 
   122 \noindent
   129 \noindent
   123 Be aware that these operations sometimes have non-intuitive properties, for example
   130 Be aware that these operations sometimes have quite non-intuitive properties, for example
   124 
   131 
   125 \begin{center}
   132 \begin{center}
   126 \begin{tabular}{ccc}
   133 \begin{tabular}{ccc}
   127 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   134 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   128 $A \cup \varnothing$ & $=$ & $A$\\
   135 $A \cup \varnothing$ & $=$ & $A$\\
   135 $A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
   142 $A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
   136 \end{tabular} &
   143 \end{tabular} &
   137 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   144 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   138 $\varnothing^*$ & $=$ & $\{""\}$\\
   145 $\varnothing^*$ & $=$ & $\{""\}$\\
   139 $\{""\}^*$ & $=$ & $\{""\}$\\
   146 $\{""\}^*$ & $=$ & $\{""\}$\\
   140 $A^1$ & $=$ & $A$\\
   147 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
   141 \end{tabular} 
   148 \end{tabular} 
   142 \end{tabular}
   149 \end{tabular}
   143 \end{center}
   150 \end{center}
   144 \bigskip
   151 \bigskip
   145 
   152 
   146 \noindent
   153 \noindent
   147 \emph{Regular expressions} are meant for conveniently describing languages...at least languages
   154 \emph{Regular expressions} are meant to conveniently describe languages...at least languages
   148 we are interested in in Computer Science.  For example there is no convenient regular expression
   155 we are interested in in Computer Science.  For example there is no convenient regular expression
   149 for describing the English language short of enumerating all English words/strings like in a dictionary. 
   156 for describing the English language short of enumerating all English words. 
   150 But they seem useful for describing all permitted email addresses, as seen above. 
   157 But they seem useful for describing all permitted email addresses, as seen above. 
   151 
   158 
   152 Regular expressions are given by the following grammar:
   159 Regular expressions are given by the following grammar:
   153 
   160 
   154 \begin{center}
   161 \begin{center}
   155 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
   162 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
   156   $r$ & $::=$ &   $\varnothing$         & null\\
   163   $r$ & $::=$ &   $\varnothing$         & null\\
   157         & $\mid$ & $\epsilon$              & empty string / "" / []\\
   164         & $\mid$ & $\epsilon$              & empty string / "" / []\\
   158         & $\mid$ & $c$                         & character\\
   165         & $\mid$ & $c$                         & single character\\
   159         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
   166         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
   160         & $\mid$ & $r_1 + r_2$            & alternative / choice\\
   167         & $\mid$ & $r_1 + r_2$            & alternative / choice\\
   161         & $\mid$ & $r^*$                      & star (zero or more)\\
   168         & $\mid$ & $r^*$                      & star (zero or more)\\
   162   \end{tabular}
   169   \end{tabular}
   163 \end{center}
   170 \end{center}
   164 
   171 
   165 \noindent
   172 \noindent
   166 There are some subtleties you should be aware of. The letter $c$ stands for any character from the
   173 Because we overload our notation there are some subtleties you should be aware of. The letter $c$ stands for any character from the
   167 alphabet. Second, we will use parentheses to disambiguate
   174 alphabet at hand. Second, we will use parentheses to disambiguate
   168 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$.
   175 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$.
   169 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
   176 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
   170 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$,
   177 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$,
   171 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
   178 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
   172 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
   179 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
   173 $r_1 + r_2$  is written as $r_1\mid{}r_2$. In case of $\cdot$ we will even often omit it all together. For example
   180 $r_1 + r_2$  is written as $r_1\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even 
   174 the regular expression for email addresses is meant to be of the form
   181 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form
   175 
   182 
   176 \[
   183 \[
   177 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
   184 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
   178 \]
   185 \]
   179 
   186 
   187 \]
   194 \]
   188 
   195 
   189 \noindent
   196 \noindent
   190 but often just write {\it hello}.
   197 but often just write {\it hello}.
   191 
   198 
   192 Another source of confusion might be that we use the term \emph{regular expressions} for the ones used in ``theory''
   199 Another source of confusion might arise from the fact that we use the term \emph{regular expressions} for the ones used in ``theory''
   193 and the ones in ``practice''. In this course by default we refer to the regular expressions defined by the grammar above. 
   200 and also the ones in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. 
   194 In ``practice'' we often use $r^+$ for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ for an optional regular
   201 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular
   195 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are mere convenience 
   202 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience 
   196 as they can be seen as shorthand for
   203 as they can be seen as shorthand for
   197 
   204 
   198 \begin{center}
   205 \begin{center}
   199 \begin{tabular}{rcl}
   206 \begin{tabular}{rcl}
   200 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
   207 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
   202 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
   209 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
   203 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
   210 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
   204 \end{tabular}
   211 \end{tabular}
   205 \end{center}
   212 \end{center}
   206 
   213 
   207 \noindent
   214 
   208 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
   215 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
   209 expression is supposed to stand for every string \emph{not} match by a regular expression. We will write
   216 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write
   210 such regular expressions as $~r$. While being ``convenience'' it is often not so clear what the shorthand for
   217 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for
   211 these kind of regular expressions is.
   218 these kind of not-regular-expressions is. Try to write down the regular expression which can match any
       
   219 string except {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include
       
   220 $\sim{}r$ in the definition or regular expressions. Whenever we do so, we will state it explicitly.
   212 
   221 
   213 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
   222 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
   214 Formally we associate with every regular expression a set of strings which are matched by this
   223 Formally, we associate with every regular expression a set of strings which are matched by this
   215 regular expression. This can be formally defined as 
   224 regular expression. This can be formally defined as 
   216 
   225 
   217 \begin{center}
   226 \begin{center}
   218 \begin{tabular}{rcl}
   227 \begin{tabular}{rcl}
   219 $L(\varnothing)$  & $\dn$ & $\{\,\}$\\
   228 $L(\varnothing)$  & $\dn$ & $\{\,\}$\\