handouts/ho01.tex
changeset 123 a75f9c9d8f94
parent 113 db6862f6bf6c
child 217 cd6066f1056a
equal deleted inserted replaced
122:4123344e6915 123:a75f9c9d8f94
   106 \[
   106 \[
   107 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
   107 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
   108 \]
   108 \]
   109 
   109 
   110 \noindent
   110 \noindent
   111 The reason is that for example before the $@$-sign there can be any string you want assuming it 
   111 One reason is that before the $@$-sign there can be any string you want assuming it 
   112 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many
   112 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many
   113 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean 
   113 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean 
   114 that every string is an email address. For example
   114 that every string is an email address. For example
   115 
   115 
   116 \[
   116 \[
   117 \text{\it foo}@\text{\it bar}.\text{\it c}
   117 "\text{\it foo}@\text{\it bar}.\text{\it c}"
   118 \]
   118 \]
   119 
   119 
   120 \noindent
   120 \noindent
   121 is not, because the top-level-domains must be of length of at least two. (Note that there is
   121 is not, because the top-level-domains must be of length of at least two. (Note that there is
   122 the convention that uppercase letters are treated in email-addresses as if they were
   122 the convention that uppercase letters are treated in email-addresses as if they were
   136 which essentially means take the first string from the set $A$ and concatenate it with every
   136 which essentially means take the first string from the set $A$ and concatenate it with every
   137 string in the set $B$, then take the second string from $A$ do the same and so on. You might
   137 string in the set $B$, then take the second string from $A$ do the same and so on. You might
   138 like to think about what this definition means in case $A$ or $B$ is the empty set.
   138 like to think about what this definition means in case $A$ or $B$ is the empty set.
   139 
   139 
   140 We also need to define
   140 We also need to define
   141 the power of a set, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
   141 the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
   142 
   142 
   143 \begin{center}
   143 \begin{center}
   144 \begin{tabular}{rcl}
   144 \begin{tabular}{rcl}
   145 $A^0$ & $\dn$ & $\{[\,]\}$ \\
   145 $A^0$ & $\dn$ & $\{[\,]\}$ \\
   146 $A^{n+1}$ & $\dn$ & $A @ A^n$\\
   146 $A^{n+1}$ & $\dn$ & $A @ A^n$\\
   166 
   166 
   167 \noindent
   167 \noindent
   168 Be aware that these operations sometimes have quite non-intuitive properties, for example
   168 Be aware that these operations sometimes have quite non-intuitive properties, for example
   169 
   169 
   170 \begin{center}
   170 \begin{center}
   171 \begin{tabular}{ccc}
   171 \begin{tabular}{@{}ccc@{}}
   172 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   172 \begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   173 $A \cup \varnothing$ & $=$ & $A$\\
   173 $A \cup \varnothing$ & $=$ & $A$\\
   174 $A \cup A$ & $=$ & $A$\\
   174 $A \cup A$ & $=$ & $A$\\
   175 $A \cup B$ & $=$ & $B \cup A$\\
   175 $A \cup B$ & $=$ & $B \cup A$\\
   176 \end{tabular} &
   176 \end{tabular} &
   177 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   177 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   178 $A @ B$ & $\not =$ & $B @ A$\\
   178 $A @ B$ & $\not =$ & $B @ A$\\
   179 $A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
   179 $A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
   180 $A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
   180 $A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
   181 \end{tabular} &
   181 \end{tabular} &
   182 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
   182 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   183 $\varnothing^*$ & $=$ & $\{""\}$\\
   183 $\varnothing^*$ & $=$ & $\{""\}$\\
   184 $\{""\}^*$ & $=$ & $\{""\}$\\
   184 $\{""\}^*$ & $=$ & $\{""\}$\\
   185 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
   185 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
   186 \end{tabular} 
   186 \end{tabular} 
   187 \end{tabular}
   187 \end{tabular}
   206         & $\mid$ & $r^*$                      & star (zero or more)\\
   206         & $\mid$ & $r^*$                      & star (zero or more)\\
   207   \end{tabular}
   207   \end{tabular}
   208 \end{center}
   208 \end{center}
   209 
   209 
   210 \noindent
   210 \noindent
   211 Because we overload our notation there are some subtleties you should be aware of. The letter $c$ stands for any character from the
   211 Because we overload our notation, there are some subtleties you should be aware of. First, the letter 
       
   212 $c$ stands for any character from the
   212 alphabet at hand. Second, we will use parentheses to disambiguate
   213 alphabet at hand. Second, we will use parentheses to disambiguate
   213 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$.
   214 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$.
   214 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
   215 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
   215 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$,
   216 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$,
   216 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
   217 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
   217 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
   218 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
   218 $r_1 + r_2$  is written as $r_1\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even 
   219 $r_1 + r_2$  is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even 
   219 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form
   220 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form
   220 
   221 
   221 \[
   222 \[
   222 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
   223 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
   223 \]
   224 \]
   224 
   225 
   225 \noindent
   226 \noindent
   226 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
   227 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
   227 a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. Similarly if
   228 a domain name (specified by the regular expression $([\ldots])^+$), then a dot and then a top-level domain. Similarly if
   228 we want to specify the regular expression for the string {\it "hello"} we should write
   229 we want to specify the regular expression for the string {\it "hello"} we should write
   229 
   230 
   230 \[
   231 \[
   231 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
   232 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
   232 \]
   233 \]
   233 
   234 
   234 \noindent
   235 \noindent
   235 but often just write {\it hello}.
   236 but often just write {\it hello}.
   236 
   237 
   237 Another source of confusion might arise from the fact that we use the term \emph{regular expressions} for the ones used in ``theory''
   238 Another source of confusion might arise from the fact that we use the term \emph{regular expression} for the regular expressions used in ``theory''
   238 and also the ones in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. 
   239 and also the ones used in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. 
   239 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular
   240 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular
   240 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience 
   241 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience 
   241 as they can be seen as shorthand for
   242 as they can be seen as shorthand for
   242 
   243 
   243 \begin{center}
   244 \begin{center}
   252 
   253 
   253 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
   254 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
   254 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write
   255 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write
   255 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for
   256 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for
   256 these kind of not-regular-expressions is. Try to write down the regular expression which can match any
   257 these kind of not-regular-expressions is. Try to write down the regular expression which can match any
   257 string except {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include
   258 string except the two strings {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include
   258 $\sim{}r$ in the definition or regular expressions. Whenever we do so, we will state it explicitly.
   259 $\sim{}r$ in the definition of regular expressions. Whenever we do so, we will state it explicitly.
   259 
   260 
   260 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
   261 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
   261 To do so more formally we will associate with every regular expression a set of strings 
   262 To do so more formally we will associate with every regular expression a set of strings 
   262 that is supposed to be  are matched by this
   263 that is supposed to be matched by this
   263 regular expression. This can be defined recursively  as follows
   264 regular expression. This can be defined recursively  as follows
   264 
   265 
   265 \begin{center}
   266 \begin{center}
   266 \begin{tabular}{rcl}
   267 \begin{tabular}{rcl}
   267 $L(\varnothing)$  & $\dn$ & $\{\,\}$\\
   268 $L(\varnothing)$  & $\dn$ & $\{\,\}$\\
   272 $L(r^*)$                   & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\
   273 $L(r^*)$                   & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\
   273 \end{tabular}
   274 \end{tabular}
   274 \end{center}
   275 \end{center}
   275 
   276 
   276 \noindent
   277 \noindent
   277 This means we can now precisely state what the meaning, for example, of the regular expression 
   278 As a result we can now precisely state what the meaning, for example, of the regular expression 
   278 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
   279 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
   279 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the 
   280 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the 
   280 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
   281 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
   281 be matched by this choice. You can now also conclude why we do not make a difference
   282 be matched by this choice. You can now also see why we do not make a difference
   282 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they 
   283 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they 
   283 are not the same regular expression, but have the same meaning. 
   284 are not the same regular expression, but have the same meaning. 
   284 
   285 
   285 The point of the definition of $L$ is that we can now precisely specify when a string $s$ is matched by a
   286 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a
   286 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
   287 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
   287 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
   288 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
   288 if $s \not\in L(r)$. We leave this for the next lecture.
   289 if $s \not\in L(r)$. We leave this for the next lecture.
   289 
   290 
   290 \begin{figure}[p]
   291 \begin{figure}[p]